Predicting Life Expectancy in a Country

Introduction

Even though there have been many studies undertaken in the past regarding different factors affecting life expectancy, none before have taken into account different inmunization effects and Human Development Index (HDI). Also, most of the research has not been done taking into account both different countries and different years (as in this one). Nevertheless, life expectancy is one of the most important indicators a country has, and therefore an indicator that tries to be improved with time through correct policies. This way, we have to ask ourselves what are the most important variables (thus, what are the most important directions public policy whould take) that helps us predict or understand the value of life expectancy in a Country.

This variable importance is straightforward clear. Hence, this gives motivation to study the relationship of this variables with some others related to economic factors, social factors, health factors inmunization factors and mortality factors. At the same time, it is important to know what variables are not so important in explaining the life expectancy in a country. This way, countries will know where not to put there resources too.

In this dataset, there is information on 193 countries that has been collected from the United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. This is also why data has been collected from the year 2000 to 2015.

Goal

In this project I want to find the best model to predict life expectancy in a Country. The goal is actually double: I want to predict this, but I am also interested in finding the most important variables to do this prediction. This way, there may be some very complex model that will held good results predicting, but will be very difficult to interpret. This is why sometimes I will also take into account the parsimony of the model, and not only its measures.

Also, to account for how good is a model I will be taking into account the following: prediction, prediction interval, Mean Absolut Error (MAE), R squared or adjusted R squared (depending on the case, sometimes one and sometimes both) to asses how much variability will be explained by the model, and how parsimonious is the model.

In this project, I will first do data engeneering and pre-processing. Then I will try and find the best model, using the most important variables to get a better performance. Then I will combine the best models to try and get a better one. Finally, I will build prediction intervals for the best model and the best combination of models. I did not build prediction intervals for every model, as most of them are not used for predicting, but to choose variables and interactions.

###Defining Colours

color_1 <- "blue"
color_2 <- "red"
color_3 <- "yellow"

Installing Libraries

library("mice")
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library("tidyverse")
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("ggplot2")
library("ggcorrplot")
library("caret")
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library("corrplot")
## corrplot 0.84 loaded
library("FOCI")
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library("Boruta")
library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library("MASS")
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library("olsrr")
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
library("psych")
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some

Setting seed

set.seed(45052185)

Reading Dataset

data <- read.csv("life_expect.csv")
dim(data)
## [1] 2938   22
head(data)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2

As we can see, we have a total of 22 variables (columns: 21 independent features and 1 output) and 2938 rows. As my intention is to predict the life expectancy of a country, given the values of some other characteristics of the country, I will drop the variables that account for the name of the country and the year the meassures were taken. This is because, again, it will not add any important information to the analysis (maybe for other analysis, it would be useful). This will be done soon, once we have fixed the missing values.

Data Preprocessing

summary(data)
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163

As we can see, we have a lot of missing value in our dataset. We get some variables, such as population with over 652 missing values! That is plenty. Also, there are some variables that have some values equal to zero. As this is a dataset taken from the WHO, we will consider that the zeros are not a mistake, but specific cases of countries where they registered that specific measure.

data$Year <- as.factor(data$Year)
sum(apply(data, 1, anyNA))
## [1] 1289

As we can see, there are 1289 rows being affected by Nans. This is a huge problem, as it is almost 30% of ours rows, meaning that it is not a good idea to drop so much information.

First, I will fill the missing values of every country, with the same information as the past (or next) year from the same country. This way, there will be consistency in eachs countries measures. This is important, as doing any other imputation with, for instance, the median or mean of the column is not a good idea. If I do this, there may be some countries that get a very ilogical value. What is more, as the countries that have many missing values are mostly very small countries (probably, the smallest in the whoele dataset) imputing in another way may cause that they will for sure get a very much larger value than they should.

Let´s fill some missing values with other values from past or forward years, for the same countries:

data <- data %>%
  dplyr::group_by(Country) %>%
  fill(Population,GDP,Life.expectancy,Schooling,Income.composition.of.resources,thinness..1.19.years,thinness.5.9.years,Total.expenditure,Hepatitis.B,BMI,Alcohol,Adult.Mortality, .direction = "downup") %>%
  dplyr::ungroup()

Let´s how many missing values are still in the dataset.

sum(apply(data, 1, anyNA))
## [1] 818

We have fixed some missing values. Nevertheless, we still have 818 rows with some of them. It is easy to tell that the biggest problems are with the variables Population and GDP. Let´s see what countries have no information whatsoever regarding their population and drop them.

bygroup <- aggregate(Population ~ Country, data=data, function(x) {sum(is.na(x))}, na.action = NULL)

Now that I grouped per countries, let´s see if there is any country that has, for example, no value at all for Population column (in any year).

no_info <- bygroup$Country[bygroup$Population ==16]
no_info
##  [1] "Antigua and Barbuda"                                 
##  [2] "Bahamas"                                             
##  [3] "Bahrain"                                             
##  [4] "Barbados"                                            
##  [5] "Bolivia (Plurinational State of)"                    
##  [6] "Brunei Darussalam"                                   
##  [7] "Congo"                                               
##  [8] "Côte d'Ivoire"                                       
##  [9] "Cuba"                                                
## [10] "Czechia"                                             
## [11] "Democratic People's Republic of Korea"               
## [12] "Democratic Republic of the Congo"                    
## [13] "Egypt"                                               
## [14] "Gambia"                                              
## [15] "Grenada"                                             
## [16] "Iran (Islamic Republic of)"                          
## [17] "Kuwait"                                              
## [18] "Kyrgyzstan"                                          
## [19] "Lao People's Democratic Republic"                    
## [20] "Libya"                                               
## [21] "Micronesia (Federated States of)"                    
## [22] "New Zealand"                                         
## [23] "Oman"                                                
## [24] "Qatar"                                               
## [25] "Republic of Korea"                                   
## [26] "Republic of Moldova"                                 
## [27] "Saint Lucia"                                         
## [28] "Saint Vincent and the Grenadines"                    
## [29] "Saudi Arabia"                                        
## [30] "Singapore"                                           
## [31] "Slovakia"                                            
## [32] "Somalia"                                             
## [33] "The former Yugoslav republic of Macedonia"           
## [34] "United Arab Emirates"                                
## [35] "United Kingdom of Great Britain and Northern Ireland"
## [36] "United Republic of Tanzania"                         
## [37] "United States of America"                            
## [38] "Venezuela (Bolivarian Republic of)"                  
## [39] "Viet Nam"                                            
## [40] "Yemen"

There are 40 countries with no value for population whtsoever. When taking a closer look, we can see that most of the missing values in our dataset come from these countries in the list. So I decide to take them out.

data <- filter(data, !Country %in% no_info)
summary(data)
##    Country               Year         Status          Life.expectancy
##  Length:2298        2013   : 153   Length:2298        Min.   :36.30  
##  Class :character   2000   : 143   Class :character   1st Qu.:62.20  
##  Mode  :character   2001   : 143   Mode  :character   Median :71.40  
##                     2002   : 143                      Mean   :68.68  
##                     2003   : 143                      3rd Qu.:75.40  
##                     2004   : 143                      Max.   :89.00  
##                     (Other):1430                      NA's   :10     
##  Adult.Mortality infant.deaths        Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 72.0   1st Qu.:   0.00   1st Qu.: 0.680   1st Qu.:   20.63      
##  Median :147.0   Median :   3.00   Median : 3.880   Median :   93.44      
##  Mean   :170.3   Mean   :  33.91   Mean   : 4.578   Mean   :  827.36      
##  3rd Qu.:237.0   3rd Qu.:  23.00   3rd Qu.: 7.450   3rd Qu.:  490.23      
##  Max.   :723.0   Max.   :1800.00   Max.   :17.870   Max.   :19479.91      
##  NA's   :10                        NA's   :17                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 2.00   Min.   :     0.0   Min.   : 1.40   Min.   :   0.00  
##  1st Qu.:64.00   1st Qu.:     0.0   1st Qu.:18.70   1st Qu.:   1.00  
##  Median :86.00   Median :    17.0   Median :42.10   Median :   4.00  
##  Mean   :73.83   Mean   :  2535.6   Mean   :37.53   Mean   :  47.14  
##  3rd Qu.:95.00   3rd Qu.:   433.2   3rd Qu.:55.80   3rd Qu.:  33.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :128                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:75.00   1st Qu.: 4.343    1st Qu.:77.00   1st Qu.: 0.100  
##  Median :92.00   Median : 5.855    Median :92.00   Median : 0.100  
##  Mean   :81.21   Mean   : 5.997    Mean   :81.31   Mean   : 2.046  
##  3rd Qu.:96.00   3rd Qu.: 7.690    3rd Qu.:96.00   3rd Qu.: 1.100  
##  Max.   :99.00   Max.   :17.240    Max.   :99.00   Max.   :50.600  
##  NA's   :19                        NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.100      
##  1st Qu.:   431.93   1st Qu.:1.966e+05   1st Qu.: 1.500      
##  Median :  1488.47   Median :1.382e+06   Median : 2.900      
##  Mean   :  6558.13   Mean   :1.273e+07   Mean   : 4.874      
##  3rd Qu.:  4974.27   3rd Qu.:7.408e+06   3rd Qu.: 7.300      
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.700      
##  NA's   :5           NA's   :8           NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.100     Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.500     1st Qu.:0.4795                  1st Qu.: 9.80  
##  Median : 3.100     Median :0.6620                  Median :12.10  
##  Mean   : 4.933     Mean   :0.6184                  Mean   :11.85  
##  3rd Qu.: 7.400     3rd Qu.:0.7665                  3rd Qu.:14.30  
##  Max.   :28.600     Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :7                       NA's   :3
dim(data)
## [1] 2298   22
sum(apply(data, 1, anyNA))
## [1] 178

As we can see, now we only have 178 missing values. We have lost some information by deleting so many rows, but we still have 2298 rows. Now we can do an imputation with the library “mice”.

data <- mice(data,m=1,method="cart")
## 
##  iter imp variable
##   1   1  Life.expectancy  Adult.Mortality  Alcohol  Hepatitis.B  BMI  Polio  Diphtheria  GDP  Population  thinness..1.19.years  thinness.5.9.years  Income.composition.of.resources  Schooling
##   2   1  Life.expectancy  Adult.Mortality  Alcohol  Hepatitis.B  BMI  Polio  Diphtheria  GDP  Population  thinness..1.19.years  thinness.5.9.years  Income.composition.of.resources  Schooling
##   3   1  Life.expectancy  Adult.Mortality  Alcohol  Hepatitis.B  BMI  Polio  Diphtheria  GDP  Population  thinness..1.19.years  thinness.5.9.years  Income.composition.of.resources  Schooling
##   4   1  Life.expectancy  Adult.Mortality  Alcohol  Hepatitis.B  BMI  Polio  Diphtheria  GDP  Population  thinness..1.19.years  thinness.5.9.years  Income.composition.of.resources  Schooling
##   5   1  Life.expectancy  Adult.Mortality  Alcohol  Hepatitis.B  BMI  Polio  Diphtheria  GDP  Population  thinness..1.19.years  thinness.5.9.years  Income.composition.of.resources  Schooling
## Warning: Number of logged events: 2
data <- complete(data)
attach(data)

Now, we don´t have any more missing values, we can keep on working with our dataset.

As explained before, let´s first drop the first and second column of the data frame.

dato <- data[,3:22]
dim(dato)
## [1] 2298   20
head(dato)
##       Status Life.expectancy Adult.Mortality infant.deaths Alcohol
## 1 Developing            65.0             263            62    0.01
## 2 Developing            59.9             271            64    0.01
## 3 Developing            59.9             268            66    0.01
## 4 Developing            59.5             272            69    0.01
## 5 Developing            59.2             275            71    0.01
## 6 Developing            58.8             279            74    0.01
##   percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths Polio
## 1              71.279624          65    1154 19.1                83     6
## 2              73.523582          62     492 18.6                86    58
## 3              73.219243          64     430 18.1                89    62
## 4              78.184215          67    2787 17.6                93    67
## 5               7.097109          68    3013 17.2                97    68
## 6              79.679367          66    1989 16.7               102    66
##   Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1              8.16         65      0.1 584.25921   33736494
## 2              8.18         62      0.1 612.69651     327582
## 3              8.13         64      0.1 631.74498   31731688
## 4              8.52         67      0.1 669.95900    3696958
## 5              7.87         68      0.1  63.53723    2978599
## 6              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2
summary(dato)
##     Status          Life.expectancy Adult.Mortality infant.deaths    
##  Length:2298        Min.   :36.30   Min.   :  1.0   Min.   :   0.00  
##  Class :character   1st Qu.:62.30   1st Qu.: 72.0   1st Qu.:   0.00  
##  Mode  :character   Median :71.40   Median :146.0   Median :   3.00  
##                     Mean   :68.69   Mean   :170.1   Mean   :  33.91  
##                     3rd Qu.:75.40   3rd Qu.:237.0   3rd Qu.:  23.00  
##                     Max.   :89.00   Max.   :723.0   Max.   :1800.00  
##     Alcohol       percentage.expenditure  Hepatitis.B       Measles        
##  Min.   : 0.010   Min.   :    0.00       Min.   : 2.00   Min.   :     0.0  
##  1st Qu.: 0.620   1st Qu.:   20.63       1st Qu.:63.00   1st Qu.:     0.0  
##  Median : 3.835   Median :   93.44       Median :86.00   Median :    17.0  
##  Mean   : 4.553   Mean   :  827.36       Mean   :72.67   Mean   :  2535.6  
##  3rd Qu.: 7.440   3rd Qu.:  490.23       3rd Qu.:95.00   3rd Qu.:   433.2  
##  Max.   :17.870   Max.   :19479.91       Max.   :99.00   Max.   :212183.0  
##       BMI        under.five.deaths     Polio       Total.expenditure
##  Min.   : 1.40   Min.   :   0.00   Min.   : 3.00   Min.   : 0.370   
##  1st Qu.:18.70   1st Qu.:   1.00   1st Qu.:75.00   1st Qu.: 4.343   
##  Median :41.40   Median :   4.00   Median :92.00   Median : 5.855   
##  Mean   :37.34   Mean   :  47.14   Mean   :81.22   Mean   : 5.997   
##  3rd Qu.:55.70   3rd Qu.:  33.00   3rd Qu.:96.00   3rd Qu.: 7.690   
##  Max.   :87.30   Max.   :2500.00   Max.   :99.00   Max.   :17.240   
##    Diphtheria       HIV.AIDS           GDP              Population       
##  Min.   : 2.00   Min.   : 0.100   Min.   :     1.68   Min.   :3.400e+01  
##  1st Qu.:77.00   1st Qu.: 0.100   1st Qu.:   430.27   1st Qu.:1.966e+05  
##  Median :92.00   Median : 0.100   Median :  1483.93   Median :1.371e+06  
##  Mean   :81.27   Mean   : 2.046   Mean   :  6548.41   Mean   :1.269e+07  
##  3rd Qu.:96.00   3rd Qu.: 1.100   3rd Qu.:  4972.20   3rd Qu.:7.388e+06  
##  Max.   :99.00   Max.   :50.600   Max.   :119172.74   Max.   :1.294e+09  
##  thinness..1.19.years thinness.5.9.years Income.composition.of.resources
##  Min.   : 0.100       Min.   : 0.100     Min.   :0.0000                 
##  1st Qu.: 1.500       1st Qu.: 1.500     1st Qu.:0.4790                 
##  Median : 3.000       Median : 3.100     Median :0.6620                 
##  Mean   : 4.895       Mean   : 4.963     Mean   :0.6173                 
##  3rd Qu.: 7.400       3rd Qu.: 7.400     3rd Qu.:0.7660                 
##  Max.   :27.700       Max.   :28.600     Max.   :0.9480                 
##    Schooling    
##  Min.   : 0.00  
##  1st Qu.: 9.80  
##  Median :12.10  
##  Mean   :11.84  
##  3rd Qu.:14.30  
##  Max.   :20.70

As we can see I have now 19 predictors, from which 18 are continuous and only 1 is categorical. Here is an explanation of them:

  1. Status: categorical variable (either developed or developing)
  2. Life Expectancy: target variable (continuous)
  3. Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  4. Infant Deaths: Number of Infant Deaths per 1000 population
  5. Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  6. Percentage Expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
  7. Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  8. Measles: Measles - number of reported cases per 1000 population
  9. BMI: Average Body Mass Index of entire population
  10. Under 5 deaths: Number of under-five deaths per 1000 population
  11. Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
  12. Total Expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  13. Diphteria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  14. HIV: Deaths per 1 000 live births HIV/AIDS (0-4 years)
  15. GDP:Gross Domestic Product per capita (in USD)
  16. Population: Population of the country
  17. Thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
  18. Thinness 10-19: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
  19. Income Composition of Resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
  20. Schooling: Number of years of Schooling, in years. .

Target Variable: Life Expectancy

As said before, my purpose is to predict the life expectancy of a country. Let´s first explore this variable.

describe(dato$Life.expectancy)
##    vars    n  mean  sd median trimmed  mad  min max range  skew kurtosis  se
## X1    1 2298 68.69 9.8   71.4   69.28 9.49 36.3  89  52.7 -0.53    -0.37 0.2

We can first see that from all our data set the minimum value of life expectancy is 36.30, while the maximum is 89. In average, the countries through the years have a life expectancy of 68.69. About the variability, the standrd deviation is 9.8. Its kurtosis is -0.37. This negative value means our variable has a “platykurtic” distribution, which means that it has a flatter peak and thinner tails compared to a normal distribution. This simply means that more data values are located near the mean and less data values are located on the tails.

Let´s plot the target variable.

ggplot(dato, aes(x=Life.expectancy)) + 
 geom_histogram(aes(y=..density..), colour=color_2, fill=color_3)+
 geom_density(alpha=0.3, fill=color_1)+
labs(x = "Life Expectancy", y = "Density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we can see that our target variable has a distribution that can be related to a normal, but as said before with more values close to the mean and not so many in the tails (maybe a gamma distribution).

Features (predictors)

features <- setdiff(colnames(dato),'data')
numeric_features <- setdiff(colnames(dato[,2:20]),'dato')

Let´s first plot an histogram to see how their univariate distribution behaves:

for (f in numeric_features) {
  hist(dato[,f], main=f)
}

As we have only one categorical variable (Status) let´s see what is its relationship with our target variable.

ggplot(dato, aes(x=Status, y=Life.expectancy, fill=Status)) + 
    geom_boxplot()

As it can be seen in the plot, developed country have a larger life expectancy median, and its distribution is much more concentrated. Meaning that there is no so much variability between developed countries. It seems that the median is close to 80. On the other hand, developing countries have a less concentrated life expectancy distribution (the box is larger), having a median lower than 70. This way, we can see that there is a huge difference between developed and developing countries. This should not surprise us, as life expectancy is one of the variables to look at when assesing if a country is (or is not) a developed country.

As I will be using CARET to model and predict my target variable, there is no need of standarization or scaling now, as the package can do it when running the algorithm.

Let´s start to asses the relationship between the variables in the dataset.

First, let´s check the autocorrelation between the variables.

correlation <- corrplot(cor(dato[,2:20]),type="upper",method="circle",title="Correlation plot",mar=c(0.1,0.1,0.1,0.1))

As we can see, there are some variables that have a strong correlation. For exapmle, there is a perfect correlation between “under five year death” and “infant mortality”. This is because they measure almost the same, but with a slight different age interval. There is also a strong correlation between “GDP” and the “percentage of expenditure”. This is also logical: the richer the country, the bigger the percentage of GDP destined to the health sector.

Let´s plot “infants deaths” vs. “under five deaths”:

ggplot(dato, aes(x=under.five.deaths, y=infant.deaths)) + 
    geom_point(size=2, shape=23)+ geom_abline(intercept = 0, slope = 1, colour = "blue") 

Let´s see what happens when trying to calculate the VIF of a multiple regression when taking into account all the covariates (to check if the two variables will be producing multicolinearity). Here, I will perform a multiple regression but without splitting the data into training and testing set, as I do not want to predict yet but to asses for multicolinearity.

multiple.lm <-  lm(Life.expectancy ~., data=dato)
vif(multiple.lm)
##                          Status                 Adult.Mortality 
##                        1.995091                        1.762713 
##                   infant.deaths                         Alcohol 
##                      178.956741                        2.151705 
##          percentage.expenditure                     Hepatitis.B 
##                        8.516394                        1.477716 
##                         Measles                             BMI 
##                        1.422992                        1.858091 
##               under.five.deaths                           Polio 
##                      178.037256                        1.932098 
##               Total.expenditure                      Diphtheria 
##                        1.174575                        2.267897 
##                        HIV.AIDS                             GDP 
##                        1.453274                        9.241228 
##                      Population            thinness..1.19.years 
##                        1.494111                        7.866809 
##              thinness.5.9.years Income.composition.of.resources 
##                        7.957444                        3.690749 
##                       Schooling 
##                        4.137874

As we can see, the VIF of “under five year death” and “infant mortality” are extremely high, meaning that they probably meassure almost the same thing. All together, I will drop the “under five year death” as between both of them is the one that has the less correlation with the target variable and this way, we will be avoiding multicolinearity.

datos <- dato[,-10]
setdiff(colnames(datos),'data')
##  [1] "Status"                          "Life.expectancy"                
##  [3] "Adult.Mortality"                 "infant.deaths"                  
##  [5] "Alcohol"                         "percentage.expenditure"         
##  [7] "Hepatitis.B"                     "Measles"                        
##  [9] "BMI"                             "Polio"                          
## [11] "Total.expenditure"               "Diphtheria"                     
## [13] "HIV.AIDS"                        "GDP"                            
## [15] "Population"                      "thinness..1.19.years"           
## [17] "thinness.5.9.years"              "Income.composition.of.resources"
## [19] "Schooling"

Checking linear dependencies

X_matr <- data.matrix(datos)
comboInfo <- findLinearCombos(X_matr)
comboInfo
## $linearCombos
## list()
## 
## $remove
## NULL

There are no columns that can be expressed as linear combination of other columns. This is fundamental to later have no problems in inverting the variance-covariance matrix when performing regression.

Modeling and Prediction

First, let´s divide the training and testing set with caret.

in_train <- createDataPartition((datos$Life.expectancy), p = 0.8, list = FALSE)  # 80% for training
training <- datos[ in_train,]
testing <- datos[-in_train,]
nrow(training)
## [1] 1841
nrow(testing)
## [1] 457
control <- trainControl(method="cv", number=10)  

Let´s explore the traning and testing set to be sure there is no huge bias in any of them.

ggplot(training, aes(x=Status, y=Life.expectancy, fill=Status)) + 
    geom_boxplot()

ggplot(testing, aes(x=Status, y=Life.expectancy, fill=Status)) + 
    geom_boxplot()

As far as we can tell from these boxplots, the training and testing set look well balanced: among developed and developing countries and life expectancy distributions.

Let´s check what are the most correlated variables with “life expectancy”:

corr_life <- sort(cor(training[,2:19])["Life.expectancy",], decreasing = T)
corr=data.frame(corr_life)
ggplot(corr,aes(x = row.names(corr), y = corr_life)) + 
  geom_bar(stat = "identity", fill = "lightblue") + 
  scale_x_discrete(limits= row.names(corr)) +
  labs(x = "Predictors", y = "Life Expectancy", title = "Correlations") + 
  theme(plot.title = element_text(hjust = 0, size = rel(1.5)),
        axis.text.x = element_text(angle = 45, hjust = 1))

As we can see, the variable that has the strongest correlation is Schooling, income composition of resources, followed by BMI, and GDP. On the other hand, there are some variables that hold a strong negative correlation with our target, such as Adult mortality (logicaly), HIV (also logicaly) and other variables that are defined by illnesses that may cause death. The only variable that apparently has no correlation with life expectancy is “Population”. Nevertheless, correlation does not necesarily imply causality, so we can not yet asses which of these predictors helps us best to predict life expectancy in a country.

Benchmark

For this project, I will use a benchmark in order to know how well or bad am I doing when bilding my predictive models. In this case, I will use a multiple regression with the three more correlated variables to the target: “schooling”, “income composition of resources” and “adult mortality”.

benchFit <- lm(Life.expectancy ~ Schooling + Income.composition.of.resources + Adult.Mortality, data=training)
predictions <- predict(benchFit, newdata=testing)
cor((testing$Life.expectancy), predictions)^2
## [1] 0.7302342
MAE <- mean(abs(predictions - testing$Life.expectancy))
MAE
## [1] 3.49089

As we can see, for our benchmark we get an R-square equal to 0.7734, meaning that we can explain 77,34% of the variability of the dataset, and our mean absolute error is of 3.2528.

Feature Selection

Let´s first see what happens when we perform a multiple linear regression, taking into account all the covariates we have, and see what is the importance of each variable.

fit.lm <- lm(Life.expectancy ~., data=training)
summary(fit.lm)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.5179  -2.3224   0.0532   2.3488  19.2881 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.241e+01  7.605e-01  68.913  < 2e-16 ***
## StatusDeveloping                -1.365e+00  3.471e-01  -3.932 8.74e-05 ***
## Adult.Mortality                 -1.522e-02  9.786e-04 -15.550  < 2e-16 ***
## infant.deaths                    2.922e-04  1.113e-03   0.262 0.793007    
## Alcohol                         -1.688e-01  3.332e-02  -5.065 4.50e-07 ***
## percentage.expenditure           1.045e-04  1.208e-04   0.865 0.387160    
## Hepatitis.B                     -6.518e-03  3.817e-03  -1.707 0.087906 .  
## Measles                         -1.618e-05  1.045e-05  -1.548 0.121902    
## BMI                              4.931e-02  6.441e-03   7.656 3.10e-14 ***
## Polio                            2.069e-02  5.544e-03   3.731 0.000196 ***
## Total.expenditure                6.224e-02  4.227e-02   1.473 0.141053    
## Diphtheria                       3.066e-02  6.037e-03   5.079 4.18e-07 ***
## HIV.AIDS                        -4.840e-01  2.030e-02 -23.844  < 2e-16 ***
## GDP                              2.636e-05  2.025e-05   1.302 0.193091    
## Population                       2.899e-10  1.846e-09   0.157 0.875229    
## thinness..1.19.years            -9.903e-02  5.777e-02  -1.714 0.086666 .  
## thinness.5.9.years               2.758e-02  5.644e-02   0.489 0.625167    
## Income.composition.of.resources  1.009e+01  8.516e-01  11.845  < 2e-16 ***
## Schooling                        8.173e-01  5.524e-02  14.797  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.042 on 1822 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8314 
## F-statistic:   505 on 18 and 1822 DF,  p-value: < 2.2e-16
lm_imp <- varImp(fit.lm, scale = F)
lm_imp
##                                    Overall
## StatusDeveloping                 3.9320310
## Adult.Mortality                 15.5498067
## infant.deaths                    0.2624459
## Alcohol                          5.0650736
## percentage.expenditure           0.8649873
## Hepatitis.B                      1.7074671
## Measles                          1.5475578
## BMI                              7.6555153
## Polio                            3.7312879
## Total.expenditure                1.4725201
## Diphtheria                       5.0791626
## HIV.AIDS                        23.8437662
## GDP                              1.3019700
## Population                       0.1570423
## thinness..1.19.years             1.7141778
## thinness.5.9.years               0.4886230
## Income.composition.of.resources 11.8450486
## Schooling                       14.7967152

According to this simple approach, the most important variable would be “HIV”, followed by “Adult Mortality”, “Schooling” and “Income Composition of Resources”. This is interesting as these were the variables that showed a strong correlation with the target variable. This way, we are accounting for the largest t-values in the regression. Let´s see what happens when we are interested in lowest p-values instead:

summary(fit.lm)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.5179  -2.3224   0.0532   2.3488  19.2881 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.241e+01  7.605e-01  68.913  < 2e-16 ***
## StatusDeveloping                -1.365e+00  3.471e-01  -3.932 8.74e-05 ***
## Adult.Mortality                 -1.522e-02  9.786e-04 -15.550  < 2e-16 ***
## infant.deaths                    2.922e-04  1.113e-03   0.262 0.793007    
## Alcohol                         -1.688e-01  3.332e-02  -5.065 4.50e-07 ***
## percentage.expenditure           1.045e-04  1.208e-04   0.865 0.387160    
## Hepatitis.B                     -6.518e-03  3.817e-03  -1.707 0.087906 .  
## Measles                         -1.618e-05  1.045e-05  -1.548 0.121902    
## BMI                              4.931e-02  6.441e-03   7.656 3.10e-14 ***
## Polio                            2.069e-02  5.544e-03   3.731 0.000196 ***
## Total.expenditure                6.224e-02  4.227e-02   1.473 0.141053    
## Diphtheria                       3.066e-02  6.037e-03   5.079 4.18e-07 ***
## HIV.AIDS                        -4.840e-01  2.030e-02 -23.844  < 2e-16 ***
## GDP                              2.636e-05  2.025e-05   1.302 0.193091    
## Population                       2.899e-10  1.846e-09   0.157 0.875229    
## thinness..1.19.years            -9.903e-02  5.777e-02  -1.714 0.086666 .  
## thinness.5.9.years               2.758e-02  5.644e-02   0.489 0.625167    
## Income.composition.of.resources  1.009e+01  8.516e-01  11.845  < 2e-16 ***
## Schooling                        8.173e-01  5.524e-02  14.797  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.042 on 1822 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8314 
## F-statistic:   505 on 18 and 1822 DF,  p-value: < 2.2e-16

As we can see, through this approach we would say that the most important variables would be the “Development Status”, “Adult Mortality”, “Infant Deaths”, Percentage Expenditure“,”BMI“,”Polio“,”Diphteria“,”HIV“,”Income Composition of resources" and “Schooling”. Between these, the lower p-values are of “Income Composition of resources”, “Schooling”, “HIV”, “BMI” and “Adult Mortality”. Results are similar to the ones obtained before, although there are some differences. Let´s take into account that our adjusted R squared is 0.837, so we are explaining big part of the variability in the dataset.

Let´s check the dignostic plots to asses if the assumptions of the model are being violated or not.

plot(fit.lm)

As we can see in the plots, the residuals seem to be not very well centered around zero, and there is clearly some heterocedasticity (non constant variance). We can say that the lineality assumption is kind of violated here. For the most part, they seem to behave as gaussian (more or less) and there seems to be some possible outliers (although there are no points outside the red line, on the corners of the forth plot, there are some points that are close it).

prLM <- predict(fit.lm, newdata=testing)
cor((testing$Life.expectancy), prLM)^2
## [1] 0.8364015
mean(abs((testing$Life.expectancy)- prLM))
## [1] 2.968201

As we can see, when taking into account all the covariates we get a better result than our benchmark. Now, we got a better R-squared, at 0.8132 and our mean absolute error has decreased from 3.2528 to 3.14. This is a good start. But from now on, I will also start paying attention to the adjusted R-squared, as I would like to penalize for the amount of variables used.

I will also run this regression through Caret Package, to be able to do a nicer comparison in the future between all predictive models.

lm1 <- train(Life.expectancy ~ .,data=training, method='lm',
                preProcess=c("scale","center"), trControl=control) 
test_results <- data.frame(Life.expectancy = (testing$Life.expectancy))
test_results$lm1 <- predict(lm1, testing)
postResample(pred = test_results$lm1,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9043953 0.8364015 2.9682015

With CARET we have an even better model to predict, as it allows us to do some automatic transformations to the data: now by scaling and centering it. Let´s see a plot of the observed vs. the predicted values of the life expectancy in the testing set:

qplot(test_results$lm1, test_results$Life.expectancy) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed")  +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

Here we can see that there is some noise in the predictions. Even though the results seem to be fitting a linear tendency, there are some big residuals, specially in the corner.

Let´s see what happens when we try again a regression with all the covariates, but now using a general linear model, accounting for the skewness in our target variable: we will consider ir a gamma. This will give us a hint on how to deal with our target variable.

lm2 <- train(Life.expectancy ~ .,data=training, method='glm',family="Gamma",
                preProcess=c("scale","center"), trControl=control) 
test_results <- data.frame(Life.expectancy = (testing$Life.expectancy))
test_results$lm2 <- predict(lm2, testing)
postResample(pred = test_results$lm2,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.8984987 0.8369797 2.9420461

When using a gamma family for the glm, we get even a better result than when assuming a gaussian distribution for our target variable. This way, we get an R squared equal to 0.8405 (higher than 0.8384 obtained with the linear model before), a mean absolute error equal to 2.9160 (lower than the 2.9382 from before) and also a lower RMSE. Until now, this is our best model.

Robust Regression

Now, I will perform two different robust regressions, with all the covariates I have. Robust regression is a special kind of linear regression, where we give different weights to each point, assigned by a specific weighting function. I will use a Huber and a Bisquare to account for possible outliers:

linfit.huber <- rlm(Life.expectancy ~ . , data=training)
summary(linfit.huber)
## 
## Call: rlm(formula = Life.expectancy ~ ., data = training)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20.71981  -2.25319   0.02124   2.25871  20.34708 
## 
## Coefficients:
##                                 Value    Std. Error t value 
## (Intercept)                      52.2654   0.6743    77.5125
## StatusDeveloping                 -0.9857   0.3078    -3.2028
## Adult.Mortality                  -0.0167   0.0009   -19.2293
## infant.deaths                     0.0006   0.0010     0.5610
## Alcohol                          -0.1559   0.0295    -5.2773
## percentage.expenditure            0.0001   0.0001     1.1980
## Hepatitis.B                      -0.0020   0.0034    -0.5771
## Measles                           0.0000   0.0000    -1.2462
## BMI                               0.0338   0.0057     5.9199
## Polio                             0.0165   0.0049     3.3624
## Total.expenditure                 0.0755   0.0375     2.0154
## Diphtheria                        0.0280   0.0054     5.2309
## HIV.AIDS                         -0.4744   0.0180   -26.3588
## GDP                               0.0000   0.0000     0.8903
## Population                        0.0000   0.0000    -0.2124
## thinness..1.19.years             -0.0880   0.0512    -1.7175
## thinness.5.9.years                0.0112   0.0500     0.2231
## Income.composition.of.resources  12.5198   0.7550    16.5831
## Schooling                         0.7576   0.0490    15.4696
## 
## Residual standard error: 3.342 on 1822 degrees of freedom
linfit.bisquare <- rlm(Life.expectancy ~ . , data=training, psi = psi.bisquare)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
summary(linfit.bisquare)
## 
## Call: rlm(formula = Life.expectancy ~ ., data = training, psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20.43139  -2.21021   0.01212   2.23254  21.25097 
## 
## Coefficients:
##                                 Value    Std. Error t value 
## (Intercept)                      51.5547   0.6834    75.4417
## StatusDeveloping                 -0.9119   0.3119    -2.9235
## Adult.Mortality                  -0.0158   0.0009   -17.9907
## infant.deaths                     0.0005   0.0010     0.5408
## Alcohol                          -0.1567   0.0299    -5.2347
## percentage.expenditure            0.0001   0.0001     1.2845
## Hepatitis.B                      -0.0003   0.0034    -0.0939
## Measles                           0.0000   0.0000    -0.8400
## BMI                               0.0285   0.0058     4.9282
## Polio                             0.0142   0.0050     2.8481
## Total.expenditure                 0.0868   0.0380     2.2847
## Diphtheria                        0.0272   0.0054     5.0110
## HIV.AIDS                         -0.4878   0.0182   -26.7448
## GDP                               0.0000   0.0000     0.7100
## Population                        0.0000   0.0000    -0.3310
## thinness..1.19.years             -0.0738   0.0519    -1.4225
## thinness.5.9.years                0.0026   0.0507     0.0520
## Income.composition.of.resources  14.2728   0.7652    18.6535
## Schooling                         0.7287   0.0496    14.6830
## 
## Residual standard error: 3.291 on 1822 degrees of freedom

Let´s see how do they predict.

prHuber <- predict(linfit.huber, newdata=testing)
cor((testing$Life.expectancy), prHuber)^2
## [1] 0.8309922
mean(abs((testing$Life.expectancy)- prHuber))
## [1] 2.945815
prBi <- predict(linfit.bisquare, newdata=testing)
cor((testing$Life.expectancy), prBi)^2
## [1] 0.825321
mean(abs((testing$Life.expectancy)- prBi))
## [1] 2.96449

Here we get an interesting result. Our R sqaured are lower in comparison to what we got in our previous models, but we predict better in our testing set with both of them. Between them two, apparently the best model to predict is the one with huber function, although they both have very similar results.

Most important Variables

Let´s try and get a more parsimonious model, when trying also to asses what is maybe the best model given our goal.

First, let´s do a forward based ordinary least of squares regression:

forwardpv <- ols_step_forward_p(fit.lm) # forward based on p-value
plot(forwardpv)

forwardpv$model
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Coefficients:
##                     (Intercept)  Income.composition.of.resources  
##                       5.244e+01                        1.011e+01  
##                       Schooling                         HIV.AIDS  
##                       8.144e-01                       -4.844e-01  
##                 Adult.Mortality                              BMI  
##                      -1.522e-02                        4.903e-02  
##                      Diphtheria                              GDP  
##                       3.089e-02                        4.202e-05  
##                           Polio                          Alcohol  
##                       2.059e-02                       -1.679e-01  
##                StatusDeveloping             thinness..1.19.years  
##                      -1.368e+00                       -6.875e-02  
##                     Hepatitis.B                          Measles  
##                      -7.057e-03                       -1.422e-05  
##               Total.expenditure  
##                       6.345e-02

As we can see, there are 15 variables taken here into account. Let´s see how it predictss:

leap1 <- predict(forwardpv$model, newdata=testing)
cor((testing$Life.expectancy), leap1)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- leap1))
## [1] 2.969078

Results are not better than the ones we already got, although they are close and with less variables used (has the highest R-squared yet but not the lowest MAE).

Let´s now try using the AKAIKE criteria.

First, let´s do forward based elimination, guiding by the AIC criterion:

forwaic <- ols_step_forward_aic(fit.lm) # forward based on AIC
forwaic$model
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Coefficients:
##                     (Intercept)                        Schooling  
##                       5.244e+01                        8.144e-01  
##                        HIV.AIDS                  Adult.Mortality  
##                      -4.844e-01                       -1.522e-02  
## Income.composition.of.resources                              BMI  
##                       1.011e+01                        4.903e-02  
##                      Diphtheria                              GDP  
##                       3.089e-02                        4.202e-05  
##                           Polio                          Alcohol  
##                       2.059e-02                       -1.679e-01  
##                StatusDeveloping             thinness..1.19.years  
##                      -1.368e+00                       -6.875e-02  
##                     Hepatitis.B                          Measles  
##                      -7.057e-03                       -1.422e-05  
##               Total.expenditure  
##                       6.345e-02
forwpred <- predict(forwaic$model, newdata=testing)
cor((testing$Life.expectancy), forwpred)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- forwpred))
## [1] 2.969078

Now, let´s do the same but with backward elimination, also guiding by the AIC criterion:

backaic  <- ols_step_backward_aic(fit.lm) # backward AIC
backaic$model
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Coefficients:
##                     (Intercept)                 StatusDeveloping  
##                       5.244e+01                       -1.368e+00  
##                 Adult.Mortality                          Alcohol  
##                      -1.522e-02                       -1.679e-01  
##                     Hepatitis.B                          Measles  
##                      -7.057e-03                       -1.422e-05  
##                             BMI                            Polio  
##                       4.903e-02                        2.059e-02  
##               Total.expenditure                       Diphtheria  
##                       6.345e-02                        3.089e-02  
##                        HIV.AIDS                              GDP  
##                      -4.844e-01                        4.202e-05  
##            thinness..1.19.years  Income.composition.of.resources  
##                      -6.875e-02                        1.011e+01  
##                       Schooling  
##                       8.144e-01
backpred <- predict(backaic$model, newdata=testing)
cor((testing$Life.expectancy), backpred)^2
## [1] 0.8365429
mean(abs((testing$Life.expectancy)- backpred))
## [1] 2.969078

As I get both with the backward and forward process of elimination the same result, there is no point in doing the bothsides step akaike elimination. In both cases, I get to a more parsimonious model in comparison with the one of full covariates, but we do not beat the previous models. Here we get an r-squared of 0.8381 and a Mean Absolute Error of 2.94.

Nevertheless, I am interested in finding what happens when the same is done, but accounting for interactions. Now it is highly likely we will get a huge model with many terms, as I will try every possible two-sided interaction (again forward, backwards and both ways, always guided by AIC criterion).

First forward:

forwint <- step(fit.lm, scope = . ~ .^2, direction = 'forward')
summary(forwint)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS + 
##     GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + 
##     BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years + 
##     percentage.expenditure:Income.composition.of.resources + 
##     Income.composition.of.resources:Schooling + BMI:Schooling + 
##     Adult.Mortality:Schooling + HIV.AIDS:Income.composition.of.resources + 
##     Total.expenditure:thinness..1.19.years + Alcohol:thinness.5.9.years + 
##     Status:Schooling + Status:Income.composition.of.resources + 
##     Status:Adult.Mortality + Measles:HIV.AIDS + Alcohol:HIV.AIDS + 
##     Alcohol:GDP + Adult.Mortality:Diphtheria + Adult.Mortality:Alcohol + 
##     Polio:Schooling + Diphtheria:Income.composition.of.resources + 
##     Status:BMI + Hepatitis.B:Total.expenditure + Total.expenditure:HIV.AIDS + 
##     Adult.Mortality:Total.expenditure + percentage.expenditure:HIV.AIDS + 
##     Alcohol:Total.expenditure + percentage.expenditure:thinness.5.9.years + 
##     Status:Alcohol + Status:thinness..1.19.years + Alcohol:Schooling + 
##     Status:GDP + percentage.expenditure:GDP + Diphtheria:Schooling + 
##     HIV.AIDS:Schooling + HIV.AIDS:thinness..1.19.years + Measles:BMI + 
##     Status:infant.deaths + Measles:Polio + infant.deaths:Total.expenditure + 
##     Total.expenditure:Population + Alcohol:Population + Total.expenditure:Diphtheria + 
##     Alcohol:Polio + Hepatitis.B:thinness..1.19.years + Alcohol:Income.composition.of.resources + 
##     Diphtheria:thinness.5.9.years + Hepatitis.B:thinness.5.9.years + 
##     BMI:Polio + infant.deaths:Hepatitis.B + infant.deaths:Population + 
##     Total.expenditure:Income.composition.of.resources + BMI:GDP + 
##     Status:Total.expenditure + Hepatitis.B:Diphtheria + Measles:Diphtheria + 
##     infant.deaths:Diphtheria + infant.deaths:Polio + infant.deaths:Income.composition.of.resources + 
##     Population:Income.composition.of.resources + percentage.expenditure:Population + 
##     Hepatitis.B:Population + Polio:GDP + Hepatitis.B:Polio + 
##     Total.expenditure:GDP + Adult.Mortality:percentage.expenditure + 
##     Adult.Mortality:GDP + Status:Diphtheria + Population:thinness.5.9.years, 
##     data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2852  -1.6312  -0.0146   1.6706  11.1827 
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             2.402e+01  6.334e+00
## StatusDeveloping                                        1.755e+01  5.873e+00
## Adult.Mortality                                         3.489e-02  6.734e-03
## infant.deaths                                           4.168e-01  1.897e-01
## Alcohol                                                -1.139e-01  2.535e-01
## percentage.expenditure                                  3.731e-03  1.333e-03
## Hepatitis.B                                            -3.852e-02  1.361e-02
## Measles                                                -2.416e-06  4.179e-05
## BMI                                                     3.898e-01  3.657e-02
## Polio                                                   7.262e-02  1.843e-02
## Total.expenditure                                       8.310e-02  2.225e-01
## Diphtheria                                              1.328e-01  3.450e-02
## HIV.AIDS                                               -1.308e+00  1.475e-01
## GDP                                                     1.710e-04  8.230e-05
## Population                                              3.378e-08  2.112e-08
## thinness..1.19.years                                   -1.610e+00  3.927e-01
## thinness.5.9.years                                      3.266e-01  1.381e-01
## Income.composition.of.resources                         3.351e+01  7.853e+00
## Schooling                                               3.414e-01  3.061e-01
## Adult.Mortality:HIV.AIDS                                7.948e-04  7.060e-05
## BMI:Diphtheria                                         -1.797e-03  3.310e-04
## thinness..1.19.years:thinness.5.9.years                 1.900e-02  3.369e-03
## percentage.expenditure:Income.composition.of.resources -5.046e-03  1.450e-03
## Income.composition.of.resources:Schooling               8.378e-01  1.611e-01
## BMI:Schooling                                          -1.681e-02  1.894e-03
## Adult.Mortality:Schooling                              -1.022e-03  3.551e-04
## HIV.AIDS:Income.composition.of.resources                7.113e-01  3.900e-01
## Total.expenditure:thinness..1.19.years                 -5.840e-02  1.057e-02
## Alcohol:thinness.5.9.years                             -4.546e-02  1.082e-02
## StatusDeveloping:Schooling                              1.362e+00  2.243e-01
## StatusDeveloping:Income.composition.of.resources       -4.361e+01  6.995e+00
## StatusDeveloping:Adult.Mortality                       -2.643e-02  5.222e-03
## Measles:HIV.AIDS                                       -9.086e-06  2.848e-06
## Alcohol:HIV.AIDS                                        6.003e-02  1.036e-02
## Alcohol:GDP                                            -6.977e-06  2.758e-06
## Adult.Mortality:Diphtheria                             -8.480e-05  3.006e-05
## Adult.Mortality:Alcohol                                -1.007e-03  2.841e-04
## Polio:Schooling                                        -6.452e-03  1.880e-03
## Diphtheria:Income.composition.of.resources              1.096e-01  2.366e-02
## StatusDeveloping:BMI                                   -4.839e-02  1.587e-02
## Hepatitis.B:Total.expenditure                           4.738e-03  1.229e-03
## Total.expenditure:HIV.AIDS                              6.310e-02  1.128e-02
## Adult.Mortality:Total.expenditure                      -1.044e-03  3.429e-04
## percentage.expenditure:HIV.AIDS                        -2.751e-04  8.857e-05
## Alcohol:Total.expenditure                              -4.295e-02  1.159e-02
## percentage.expenditure:thinness.5.9.years               2.773e-04  6.977e-05
## StatusDeveloping:Alcohol                                2.027e-01  8.642e-02
## StatusDeveloping:thinness..1.19.years                   1.318e+00  3.736e-01
## Alcohol:Schooling                                       5.856e-02  1.878e-02
## StatusDeveloping:GDP                                    7.194e-05  2.277e-05
## percentage.expenditure:GDP                              6.251e-09  1.782e-09
## Diphtheria:Schooling                                   -3.624e-03  2.131e-03
## HIV.AIDS:Schooling                                     -4.388e-02  1.711e-02
## HIV.AIDS:thinness..1.19.years                           7.464e-03  4.344e-03
## Measles:BMI                                            -3.290e-06  1.193e-06
## StatusDeveloping:infant.deaths                         -4.150e-01  1.897e-01
## Measles:Polio                                           4.001e-06  1.193e-06
## infant.deaths:Total.expenditure                        -3.527e-03  1.217e-03
## Total.expenditure:Population                            6.338e-09  1.675e-09
## Alcohol:Population                                     -2.558e-09  1.007e-09
## Total.expenditure:Diphtheria                           -4.879e-03  1.830e-03
## Alcohol:Polio                                           3.063e-03  1.341e-03
## Hepatitis.B:thinness..1.19.years                        4.992e-03  1.725e-03
## Alcohol:Income.composition.of.resources                -8.690e-01  3.341e-01
## Diphtheria:thinness.5.9.years                          -2.659e-03  1.149e-03
## Hepatitis.B:thinness.5.9.years                         -2.453e-03  1.659e-03
## BMI:Polio                                               4.762e-04  2.700e-04
## infant.deaths:Hepatitis.B                              -8.572e-05  3.370e-05
## infant.deaths:Population                                8.887e-12  1.179e-11
## Total.expenditure:Income.composition.of.resources       4.240e-01  2.235e-01
## BMI:GDP                                                 6.358e-07  4.127e-07
## StatusDeveloping:Total.expenditure                      2.105e-01  1.134e-01
## Hepatitis.B:Diphtheria                                  2.964e-04  1.376e-04
## Measles:Diphtheria                                     -2.473e-06  1.092e-06
## infant.deaths:Diphtheria                                1.204e-04  5.260e-05
## infant.deaths:Polio                                    -2.304e-04  7.363e-05
## infant.deaths:Income.composition.of.resources           3.547e-02  1.214e-02
## Population:Income.composition.of.resources             -1.051e-07  3.605e-08
## percentage.expenditure:Population                       6.742e-12  2.948e-12
## Hepatitis.B:Population                                  2.069e-10  9.662e-11
## Polio:GDP                                              -1.807e-06  7.306e-07
## Hepatitis.B:Polio                                      -2.432e-04  1.449e-04
## Total.expenditure:GDP                                   4.349e-06  2.568e-06
## Adult.Mortality:percentage.expenditure                  6.031e-06  2.337e-06
## Adult.Mortality:GDP                                    -6.573e-07  3.163e-07
## StatusDeveloping:Diphtheria                            -3.646e-02  2.235e-02
## Population:thinness.5.9.years                          -6.658e-10  4.591e-10
##                                                        t value Pr(>|t|)    
## (Intercept)                                              3.792 0.000154 ***
## StatusDeveloping                                         2.988 0.002849 ** 
## Adult.Mortality                                          5.182 2.45e-07 ***
## infant.deaths                                            2.197 0.028125 *  
## Alcohol                                                 -0.449 0.653196    
## percentage.expenditure                                   2.799 0.005184 ** 
## Hepatitis.B                                             -2.830 0.004713 ** 
## Measles                                                 -0.058 0.953909    
## BMI                                                     10.659  < 2e-16 ***
## Polio                                                    3.941 8.42e-05 ***
## Total.expenditure                                        0.373 0.708853    
## Diphtheria                                               3.848 0.000124 ***
## HIV.AIDS                                                -8.868  < 2e-16 ***
## GDP                                                      2.077 0.037911 *  
## Population                                               1.599 0.109910    
## thinness..1.19.years                                    -4.099 4.34e-05 ***
## thinness.5.9.years                                       2.364 0.018174 *  
## Income.composition.of.resources                          4.267 2.09e-05 ***
## Schooling                                                1.115 0.264855    
## Adult.Mortality:HIV.AIDS                                11.258  < 2e-16 ***
## BMI:Diphtheria                                          -5.428 6.50e-08 ***
## thinness..1.19.years:thinness.5.9.years                  5.640 1.98e-08 ***
## percentage.expenditure:Income.composition.of.resources  -3.481 0.000513 ***
## Income.composition.of.resources:Schooling                5.200 2.23e-07 ***
## BMI:Schooling                                           -8.874  < 2e-16 ***
## Adult.Mortality:Schooling                               -2.879 0.004042 ** 
## HIV.AIDS:Income.composition.of.resources                 1.824 0.068350 .  
## Total.expenditure:thinness..1.19.years                  -5.527 3.74e-08 ***
## Alcohol:thinness.5.9.years                              -4.202 2.78e-05 ***
## StatusDeveloping:Schooling                               6.075 1.52e-09 ***
## StatusDeveloping:Income.composition.of.resources        -6.234 5.67e-10 ***
## StatusDeveloping:Adult.Mortality                        -5.062 4.58e-07 ***
## Measles:HIV.AIDS                                        -3.191 0.001443 ** 
## Alcohol:HIV.AIDS                                         5.793 8.17e-09 ***
## Alcohol:GDP                                             -2.530 0.011501 *  
## Adult.Mortality:Diphtheria                              -2.821 0.004844 ** 
## Adult.Mortality:Alcohol                                 -3.546 0.000402 ***
## Polio:Schooling                                         -3.431 0.000614 ***
## Diphtheria:Income.composition.of.resources               4.630 3.92e-06 ***
## StatusDeveloping:BMI                                    -3.049 0.002328 ** 
## Hepatitis.B:Total.expenditure                            3.854 0.000120 ***
## Total.expenditure:HIV.AIDS                               5.594 2.57e-08 ***
## Adult.Mortality:Total.expenditure                       -3.044 0.002371 ** 
## percentage.expenditure:HIV.AIDS                         -3.106 0.001928 ** 
## Alcohol:Total.expenditure                               -3.705 0.000218 ***
## percentage.expenditure:thinness.5.9.years                3.974 7.35e-05 ***
## StatusDeveloping:Alcohol                                 2.345 0.019129 *  
## StatusDeveloping:thinness..1.19.years                    3.529 0.000427 ***
## Alcohol:Schooling                                        3.119 0.001847 ** 
## StatusDeveloping:GDP                                     3.160 0.001607 ** 
## percentage.expenditure:GDP                               3.507 0.000464 ***
## Diphtheria:Schooling                                    -1.701 0.089148 .  
## HIV.AIDS:Schooling                                      -2.564 0.010426 *  
## HIV.AIDS:thinness..1.19.years                            1.718 0.085895 .  
## Measles:BMI                                             -2.758 0.005884 ** 
## StatusDeveloping:infant.deaths                          -2.188 0.028814 *  
## Measles:Polio                                            3.354 0.000812 ***
## infant.deaths:Total.expenditure                         -2.897 0.003808 ** 
## Total.expenditure:Population                             3.785 0.000159 ***
## Alcohol:Population                                      -2.540 0.011178 *  
## Total.expenditure:Diphtheria                            -2.666 0.007742 ** 
## Alcohol:Polio                                            2.284 0.022478 *  
## Hepatitis.B:thinness..1.19.years                         2.894 0.003853 ** 
## Alcohol:Income.composition.of.resources                 -2.601 0.009381 ** 
## Diphtheria:thinness.5.9.years                           -2.315 0.020718 *  
## Hepatitis.B:thinness.5.9.years                          -1.479 0.139424    
## BMI:Polio                                                1.764 0.077966 .  
## infant.deaths:Hepatitis.B                               -2.544 0.011052 *  
## infant.deaths:Population                                 0.754 0.451133    
## Total.expenditure:Income.composition.of.resources        1.897 0.057982 .  
## BMI:GDP                                                  1.540 0.123644    
## StatusDeveloping:Total.expenditure                       1.857 0.063445 .  
## Hepatitis.B:Diphtheria                                   2.153 0.031420 *  
## Measles:Diphtheria                                      -2.266 0.023594 *  
## infant.deaths:Diphtheria                                 2.289 0.022201 *  
## infant.deaths:Polio                                     -3.129 0.001785 ** 
## infant.deaths:Income.composition.of.resources            2.921 0.003537 ** 
## Population:Income.composition.of.resources              -2.914 0.003611 ** 
## percentage.expenditure:Population                        2.287 0.022291 *  
## Hepatitis.B:Population                                   2.142 0.032361 *  
## Polio:GDP                                               -2.473 0.013500 *  
## Hepatitis.B:Polio                                       -1.679 0.093402 .  
## Total.expenditure:GDP                                    1.693 0.090558 .  
## Adult.Mortality:percentage.expenditure                   2.581 0.009935 ** 
## Adult.Mortality:GDP                                     -2.078 0.037856 *  
## StatusDeveloping:Diphtheria                             -1.631 0.103096    
## Population:thinness.5.9.years                           -1.450 0.147176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.14 on 1754 degrees of freedom
## Multiple R-squared:  0.903,  Adjusted R-squared:  0.8982 
## F-statistic: 189.8 on 86 and 1754 DF,  p-value: < 2.2e-16

As we suspected, we have a huge (and impossible to interpret) model, given all possible interactions. Nevertheless, our R-squares is at 0.9021, that is a great sign that shows us that we are accounting for more than 90% of the total variability in the dataset with this model. Let´s see how it predicts.

test_results$forwint <- predict(forwint, testing)
postResample(pred = test_results$forwint,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.4617900 0.8734371 2.6182675
qplot(test_results$forwint, test_results$Life.expectancy) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

bothint <- step(fit.lm, scope = . ~ .^2, direction = 'both')
summary(bothint)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS + 
##     GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + 
##     BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years + 
##     percentage.expenditure:Income.composition.of.resources + 
##     Income.composition.of.resources:Schooling + BMI:Schooling + 
##     Adult.Mortality:Schooling + HIV.AIDS:Income.composition.of.resources + 
##     Total.expenditure:thinness..1.19.years + Alcohol:thinness.5.9.years + 
##     Status:Schooling + Status:Income.composition.of.resources + 
##     Status:Adult.Mortality + Measles:HIV.AIDS + Alcohol:HIV.AIDS + 
##     Alcohol:GDP + Adult.Mortality:Diphtheria + Adult.Mortality:Alcohol + 
##     Polio:Schooling + Diphtheria:Income.composition.of.resources + 
##     Status:BMI + Hepatitis.B:Total.expenditure + Total.expenditure:HIV.AIDS + 
##     Adult.Mortality:Total.expenditure + percentage.expenditure:HIV.AIDS + 
##     Alcohol:Total.expenditure + percentage.expenditure:thinness.5.9.years + 
##     Status:Alcohol + Status:thinness..1.19.years + Alcohol:Schooling + 
##     Status:GDP + percentage.expenditure:GDP + HIV.AIDS:Schooling + 
##     HIV.AIDS:thinness..1.19.years + Measles:BMI + Status:infant.deaths + 
##     Measles:Polio + infant.deaths:Total.expenditure + Total.expenditure:Population + 
##     Alcohol:Population + Total.expenditure:Diphtheria + Alcohol:Polio + 
##     Hepatitis.B:thinness..1.19.years + Alcohol:Income.composition.of.resources + 
##     Diphtheria:thinness.5.9.years + BMI:Polio + infant.deaths:Hepatitis.B + 
##     Total.expenditure:Income.composition.of.resources + Status:Total.expenditure + 
##     Hepatitis.B:Diphtheria + Measles:Diphtheria + infant.deaths:Diphtheria + 
##     infant.deaths:Polio + infant.deaths:Income.composition.of.resources + 
##     Population:Income.composition.of.resources + percentage.expenditure:Population + 
##     Hepatitis.B:Population + Polio:GDP + Hepatitis.B:Polio + 
##     Total.expenditure:GDP + Adult.Mortality:percentage.expenditure + 
##     Adult.Mortality:GDP + Status:Polio + Alcohol:BMI + Hepatitis.B:thinness.5.9.years + 
##     Population:thinness.5.9.years + Population:thinness..1.19.years + 
##     infant.deaths:thinness..1.19.years, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5270  -1.6030  -0.0245   1.6431  11.0626 
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             2.306e+01  6.401e+00
## StatusDeveloping                                        1.848e+01  5.939e+00
## Adult.Mortality                                         3.503e-02  6.753e-03
## infant.deaths                                           3.775e-01  1.911e-01
## Alcohol                                                -1.421e-01  2.552e-01
## percentage.expenditure                                  3.512e-03  1.317e-03
## Hepatitis.B                                            -3.905e-02  1.350e-02
## Measles                                                -2.305e-06  4.155e-05
## BMI                                                     3.975e-01  3.625e-02
## Polio                                                   1.290e-01  3.284e-02
## Total.expenditure                                       7.095e-02  2.224e-01
## Diphtheria                                              7.352e-02  2.199e-02
## HIV.AIDS                                               -1.297e+00  1.475e-01
## GDP                                                     2.036e-04  8.045e-05
## Population                                              2.271e-08  2.168e-08
## thinness..1.19.years                                   -1.637e+00  3.936e-01
## thinness.5.9.years                                      3.395e-01  1.377e-01
## Income.composition.of.resources                         3.462e+01  7.748e+00
## Schooling                                               3.458e-01  3.029e-01
## Adult.Mortality:HIV.AIDS                                7.991e-04  7.060e-05
## BMI:Diphtheria                                         -2.023e-03  3.106e-04
## thinness..1.19.years:thinness.5.9.years                 2.179e-02  3.744e-03
## percentage.expenditure:Income.composition.of.resources -4.737e-03  1.423e-03
## Income.composition.of.resources:Schooling               8.077e-01  1.590e-01
## BMI:Schooling                                          -1.745e-02  1.961e-03
## Adult.Mortality:Schooling                              -1.130e-03  3.549e-04
## HIV.AIDS:Income.composition.of.resources                6.690e-01  3.884e-01
## Total.expenditure:thinness..1.19.years                 -5.954e-02  1.067e-02
## Alcohol:thinness.5.9.years                             -3.809e-02  1.133e-02
## StatusDeveloping:Schooling                              1.314e+00  2.252e-01
## StatusDeveloping:Income.composition.of.resources       -4.351e+01  7.002e+00
## StatusDeveloping:Adult.Mortality                       -2.593e-02  5.227e-03
## Measles:HIV.AIDS                                       -9.133e-06  2.844e-06
## Alcohol:HIV.AIDS                                        6.044e-02  1.041e-02
## Alcohol:GDP                                            -7.187e-06  2.761e-06
## Adult.Mortality:Diphtheria                             -7.825e-05  3.020e-05
## Adult.Mortality:Alcohol                                -9.920e-04  2.845e-04
## Polio:Schooling                                        -8.506e-03  1.529e-03
## Diphtheria:Income.composition.of.resources              9.609e-02  2.021e-02
## StatusDeveloping:BMI                                   -4.553e-02  1.627e-02
## Hepatitis.B:Total.expenditure                           4.709e-03  1.224e-03
## Total.expenditure:HIV.AIDS                              6.273e-02  1.128e-02
## Adult.Mortality:Total.expenditure                      -1.050e-03  3.417e-04
## percentage.expenditure:HIV.AIDS                        -2.861e-04  8.863e-05
## Alcohol:Total.expenditure                              -4.257e-02  1.159e-02
## percentage.expenditure:thinness.5.9.years               2.703e-04  6.965e-05
## StatusDeveloping:Alcohol                                1.782e-01  8.684e-02
## StatusDeveloping:thinness..1.19.years                   1.306e+00  3.729e-01
## Alcohol:Schooling                                       5.379e-02  1.882e-02
## StatusDeveloping:GDP                                    7.551e-05  2.258e-05
## percentage.expenditure:GDP                              5.726e-09  1.703e-09
## HIV.AIDS:Schooling                                     -4.268e-02  1.701e-02
## HIV.AIDS:thinness..1.19.years                           6.924e-03  4.360e-03
## Measles:BMI                                            -3.279e-06  1.198e-06
## StatusDeveloping:infant.deaths                         -3.739e-01  1.911e-01
## Measles:Polio                                           4.130e-06  1.188e-06
## infant.deaths:Total.expenditure                        -3.035e-03  1.246e-03
## Total.expenditure:Population                            6.723e-09  1.651e-09
## Alcohol:Population                                     -2.766e-09  9.832e-10
## Total.expenditure:Diphtheria                           -4.803e-03  1.821e-03
## Alcohol:Polio                                           2.977e-03  1.381e-03
## Hepatitis.B:thinness..1.19.years                        4.924e-03  1.708e-03
## Alcohol:Income.composition.of.resources                -8.987e-01  3.345e-01
## Diphtheria:thinness.5.9.years                          -2.931e-03  1.145e-03
## BMI:Polio                                               5.953e-04  2.561e-04
## infant.deaths:Hepatitis.B                              -1.085e-04  3.748e-05
## Total.expenditure:Income.composition.of.resources       4.263e-01  2.230e-01
## StatusDeveloping:Total.expenditure                      2.143e-01  1.132e-01
## Hepatitis.B:Diphtheria                                  3.186e-04  1.375e-04
## Measles:Diphtheria                                     -2.662e-06  1.089e-06
## infant.deaths:Diphtheria                                1.288e-04  5.247e-05
## infant.deaths:Polio                                    -2.228e-04  7.456e-05
## infant.deaths:Income.composition.of.resources           3.450e-02  1.216e-02
## Population:Income.composition.of.resources             -8.581e-08  3.626e-08
## percentage.expenditure:Population                       6.886e-12  2.943e-12
## Hepatitis.B:Population                                  1.390e-10  5.800e-11
## Polio:GDP                                              -1.796e-06  7.294e-07
## Hepatitis.B:Polio                                      -2.496e-04  1.451e-04
## Total.expenditure:GDP                                   4.335e-06  2.566e-06
## Adult.Mortality:percentage.expenditure                  6.013e-06  2.333e-06
## Adult.Mortality:GDP                                    -6.794e-07  3.142e-07
## StatusDeveloping:Polio                                 -3.952e-02  2.525e-02
## Alcohol:BMI                                             2.623e-03  1.599e-03
## Hepatitis.B:thinness.5.9.years                         -2.394e-03  1.643e-03
## Population:thinness.5.9.years                          -1.358e-09  7.128e-10
## Population:thinness..1.19.years                         1.192e-09  7.439e-10
## infant.deaths:thinness..1.19.years                     -1.855e-04  1.331e-04
##                                                        t value Pr(>|t|)    
## (Intercept)                                              3.602 0.000325 ***
## StatusDeveloping                                         3.111 0.001893 ** 
## Adult.Mortality                                          5.188 2.38e-07 ***
## infant.deaths                                            1.976 0.048341 *  
## Alcohol                                                 -0.557 0.577669    
## percentage.expenditure                                   2.668 0.007701 ** 
## Hepatitis.B                                             -2.892 0.003873 ** 
## Measles                                                 -0.055 0.955767    
## BMI                                                     10.966  < 2e-16 ***
## Polio                                                    3.928 8.91e-05 ***
## Total.expenditure                                        0.319 0.749759    
## Diphtheria                                               3.343 0.000846 ***
## HIV.AIDS                                                -8.797  < 2e-16 ***
## GDP                                                      2.531 0.011459 *  
## Population                                               1.047 0.295194    
## thinness..1.19.years                                    -4.159 3.35e-05 ***
## thinness.5.9.years                                       2.465 0.013807 *  
## Income.composition.of.resources                          4.468 8.42e-06 ***
## Schooling                                                1.142 0.253712    
## Adult.Mortality:HIV.AIDS                                11.318  < 2e-16 ***
## BMI:Diphtheria                                          -6.515 9.50e-11 ***
## thinness..1.19.years:thinness.5.9.years                  5.821 6.94e-09 ***
## percentage.expenditure:Income.composition.of.resources  -3.329 0.000889 ***
## Income.composition.of.resources:Schooling                5.081 4.15e-07 ***
## BMI:Schooling                                           -8.897  < 2e-16 ***
## Adult.Mortality:Schooling                               -3.184 0.001477 ** 
## HIV.AIDS:Income.composition.of.resources                 1.723 0.085147 .  
## Total.expenditure:thinness..1.19.years                  -5.580 2.79e-08 ***
## Alcohol:thinness.5.9.years                              -3.362 0.000789 ***
## StatusDeveloping:Schooling                               5.834 6.43e-09 ***
## StatusDeveloping:Income.composition.of.resources        -6.214 6.44e-10 ***
## StatusDeveloping:Adult.Mortality                        -4.961 7.69e-07 ***
## Measles:HIV.AIDS                                        -3.211 0.001345 ** 
## Alcohol:HIV.AIDS                                         5.804 7.69e-09 ***
## Alcohol:GDP                                             -2.603 0.009306 ** 
## Adult.Mortality:Diphtheria                              -2.591 0.009646 ** 
## Adult.Mortality:Alcohol                                 -3.487 0.000500 ***
## Polio:Schooling                                         -5.563 3.06e-08 ***
## Diphtheria:Income.composition.of.resources               4.756 2.14e-06 ***
## StatusDeveloping:BMI                                    -2.799 0.005190 ** 
## Hepatitis.B:Total.expenditure                            3.846 0.000124 ***
## Total.expenditure:HIV.AIDS                               5.563 3.06e-08 ***
## Adult.Mortality:Total.expenditure                       -3.073 0.002151 ** 
## percentage.expenditure:HIV.AIDS                         -3.228 0.001272 ** 
## Alcohol:Total.expenditure                               -3.673 0.000247 ***
## percentage.expenditure:thinness.5.9.years                3.881 0.000108 ***
## StatusDeveloping:Alcohol                                 2.052 0.040336 *  
## StatusDeveloping:thinness..1.19.years                    3.502 0.000474 ***
## Alcohol:Schooling                                        2.858 0.004310 ** 
## StatusDeveloping:GDP                                     3.345 0.000841 ***
## percentage.expenditure:GDP                               3.362 0.000792 ***
## HIV.AIDS:Schooling                                      -2.509 0.012206 *  
## HIV.AIDS:thinness..1.19.years                            1.588 0.112485    
## Measles:BMI                                             -2.737 0.006261 ** 
## StatusDeveloping:infant.deaths                          -1.956 0.050566 .  
## Measles:Polio                                            3.477 0.000519 ***
## infant.deaths:Total.expenditure                         -2.436 0.014931 *  
## Total.expenditure:Population                             4.073 4.85e-05 ***
## Alcohol:Population                                      -2.814 0.004950 ** 
## Total.expenditure:Diphtheria                            -2.637 0.008432 ** 
## Alcohol:Polio                                            2.156 0.031216 *  
## Hepatitis.B:thinness..1.19.years                         2.883 0.003991 ** 
## Alcohol:Income.composition.of.resources                 -2.687 0.007275 ** 
## Diphtheria:thinness.5.9.years                           -2.559 0.010591 *  
## BMI:Polio                                                2.324 0.020241 *  
## infant.deaths:Hepatitis.B                               -2.896 0.003827 ** 
## Total.expenditure:Income.composition.of.resources        1.911 0.056119 .  
## StatusDeveloping:Total.expenditure                       1.894 0.058450 .  
## Hepatitis.B:Diphtheria                                   2.316 0.020658 *  
## Measles:Diphtheria                                      -2.444 0.014627 *  
## infant.deaths:Diphtheria                                 2.456 0.014164 *  
## infant.deaths:Polio                                     -2.989 0.002842 ** 
## infant.deaths:Income.composition.of.resources            2.838 0.004595 ** 
## Population:Income.composition.of.resources              -2.366 0.018072 *  
## percentage.expenditure:Population                        2.340 0.019408 *  
## Hepatitis.B:Population                                   2.397 0.016625 *  
## Polio:GDP                                               -2.462 0.013910 *  
## Hepatitis.B:Polio                                       -1.719 0.085709 .  
## Total.expenditure:GDP                                    1.689 0.091356 .  
## Adult.Mortality:percentage.expenditure                   2.577 0.010052 *  
## Adult.Mortality:GDP                                     -2.162 0.030754 *  
## StatusDeveloping:Polio                                  -1.565 0.117756    
## Alcohol:BMI                                              1.641 0.101080    
## Hepatitis.B:thinness.5.9.years                          -1.457 0.145220    
## Population:thinness.5.9.years                           -1.906 0.056829 .  
## Population:thinness..1.19.years                          1.603 0.109205    
## infant.deaths:thinness..1.19.years                      -1.394 0.163568    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.139 on 1754 degrees of freedom
## Multiple R-squared:  0.9031, Adjusted R-squared:  0.8983 
## F-statistic:   190 on 86 and 1754 DF,  p-value: < 2.2e-16
test_results$bothint <- predict(bothint, testing)
postResample(pred = test_results$bothint,  obs = test_results$Life.expectancy)
##     RMSE Rsquared      MAE 
## 3.486865 0.871720 2.631852

This is the best model until now both for predicting and for accounting for the variability of the data.

I will define the model of bothsides with interaction, to use in Caret later for a better comparison.

Model1 <- Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
    Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
    BMI + Polio + Total.expenditure + Diphtheria + HIV.AIDS + 
    GDP + Population + thinness..1.19.years + thinness.5.9.years + 
    Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + 
    BMI:Diphtheria + thinness..1.19.years:thinness.5.9.years + 
    percentage.expenditure:Income.composition.of.resources + 
    Adult.Mortality:Schooling + BMI:Schooling + Income.composition.of.resources:Schooling + 
    HIV.AIDS:Income.composition.of.resources + Alcohol:thinness..1.19.years + 
    Status:Schooling + Status:Income.composition.of.resources + 
    Total.expenditure:thinness..1.19.years + Status:Adult.Mortality + 
    Measles:HIV.AIDS + Polio:Schooling + Diphtheria:Income.composition.of.resources + Alcohol:HIV.AIDS + Status:GDP + HIV.AIDS:GDP + percentage.expenditure:thinness..1.19.years + Status:BMI + Total.expenditure:HIV.AIDS + HIV.AIDS:Schooling + 
    Adult.Mortality:Hepatitis.B + Measles:Total.expenditure + 
    Status:Alcohol + infant.deaths:BMI + Alcohol:Schooling + 
    Status:thinness..1.19.years + Status:Total.expenditure + 
    Hepatitis.B:Measles + infant.deaths:Total.expenditure + Total.expenditure:Population + Alcohol:Population + Status:infant.deaths + Adult.Mortality:Alcohol + 
    percentage.expenditure:GDP + BMI:GDP + Alcohol:percentage.expenditure + 
    infant.deaths:thinness.5.9.years + Hepatitis.B:Diphtheria + 
    Polio:Income.composition.of.resources + BMI:Polio + Alcohol:Polio + 
    percentage.expenditure:Population + Hepatitis.B:HIV.AIDS + 
    Diphtheria:HIV.AIDS + Alcohol:Total.expenditure + Total.expenditure:Income.composition.of.resources + 
    Alcohol:Income.composition.of.resources + infant.deaths:Income.composition.of.resources + 
    Measles:Income.composition.of.resources + Polio:Diphtheria + 
    infant.deaths:GDP + Status:Population + Hepatitis.B:thinness..1.19.years + 
    Hepatitis.B:thinness.5.9.years + Diphtheria:thinness..1.19.years + 
    infant.deaths:Polio + Diphtheria:GDP + Status:Diphtheria + 
    Measles:Polio

Let´s plot Observed vs. Predicted values

qplot(test_results$bothint, test_results$Life.expectancy) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

Now, let´s see what are the most important variables, and the most important interactions according to this proceadure.

modboth <- train(Model1, data = training, 
                    method = "lm", 
                    preProc=c('scale', 'center'),
                    trControl = control)
importance <- varImp(modboth, scale=FALSE)
plot(importance,top=15)

As we can see, BMI seems to be now the most important variable and the most important interactions are “Adult Mortality:HIV”, “BMI:Schooling”. Until now, this is our best model for predicting life expectancy and accounting for the data variability. Nevertheless, it is a huge and impossible model to interpret.

Let´s try also the same model but now through a general linear model, using “Gamma” family.

modboth2 <- train(Model1, data = training, 
                    method = "glm", family="Gamma",
                    preProc=c('scale', 'center'),
                    trControl = control)
test_results$modboth2 <- predict(modboth2, testing)
postResample(pred = test_results$modboth2,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.2187054 0.8892586 2.4491829
importance <- varImp(modboth2, scale=FALSE)
plot(importance,top=15)

Now, the model has not got better. Although is fairly good, in comparison with the rest, it does not improve the one with the same variables and interaction, but assuming normality.

Forward Regression

Let´s try with other options. First let´s do forward regression with Caret.

for_tune <- train(Life.expectancy ~ ., data = training, 
                  method = "leapForward", 
                  preProc=c('scale', 'center'),
                  tuneGrid = expand.grid(nvmax = 4:19),
                  trControl = control)
for_tune
## Linear Regression with Forward Selection 
## 
## 1841 samples
##   18 predictor
## 
## Pre-processing: scaled (18), centered (18) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1658, 1657, 1657, 1656, 1657, 1656, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    4     4.317675  0.8095993  3.159062
##    5     4.268483  0.8142712  3.148148
##    6     4.175195  0.8219003  3.092526
##    7     4.131657  0.8256070  3.069960
##    8     4.121136  0.8265466  3.060980
##    9     4.099382  0.8285044  3.056625
##   10     4.074994  0.8305700  3.051912
##   11     4.057618  0.8319578  3.041665
##   12     4.067163  0.8311810  3.050688
##   13     4.071590  0.8308270  3.055188
##   14     4.067128  0.8312313  3.053511
##   15     4.065701  0.8313714  3.054043
##   16     4.065881  0.8313733  3.055358
##   17     4.065712  0.8313851  3.055441
##   18     4.065680  0.8313897  3.055641
##   19     4.065680  0.8313897  3.055641
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 11.
plot(for_tune)

Now, let´s inspect which variables are selected and how well does this model perform:

coef(for_tune$finalModel, for_tune$bestTune$nvmax)
##                     (Intercept)                StatusDeveloping 
##                      68.6932102                      -0.5644930 
##                 Adult.Mortality                         Alcohol 
##                      -1.9452031                      -0.7045744 
##                             BMI                           Polio 
##                       0.9964461                       0.4831664 
##                      Diphtheria                        HIV.AIDS 
##                       0.6505967                      -2.7200491 
##                             GDP            thinness..1.19.years 
##                       0.6455364                      -0.3768232 
## Income.composition.of.resources                       Schooling 
##                       2.1330433                       2.8661297
test_results$frw <- predict(for_tune, testing)
postResample(pred = test_results$frw,  obs = test_results$Life.expectancy)
##     RMSE Rsquared      MAE 
## 3.924330 0.834699 2.982841

As it can be seen, this is not a great model, neither for the R-squared nor for the MAE or RMSE. We can see the Observed vs. Predicted values as it follows:

qplot(test_results$frw, test_results$Life.expectancy) + 
  labs(title="Forward Regression Observed VS Predicted", x="Predicted", y="Observed") +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

Backward Regression

Let´s now check with backward regression

back_tune <- train(Life.expectancy ~ ., data = training, 
                   method = "leapBackward", 
                   preProc=c('scale', 'center'),
                   tuneGrid = expand.grid(nvmax = 4:19),
                   trControl = control)
back_tune
## Linear Regression with Backwards Selection 
## 
## 1841 samples
##   18 predictor
## 
## Pre-processing: scaled (18), centered (18) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1657, 1657, 1658, 1658, 1657, 1656, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    4     4.319974  0.8087610  3.158816
##    5     4.249381  0.8149485  3.127064
##    6     4.155671  0.8232294  3.073276
##    7     4.107750  0.8271224  3.050102
##    8     4.096536  0.8280898  3.042170
##    9     4.075513  0.8297994  3.048187
##   10     4.068795  0.8301657  3.043452
##   11     4.056296  0.8311632  3.039057
##   12     4.077251  0.8293977  3.059022
##   13     4.077811  0.8293867  3.060592
##   14     4.073521  0.8297927  3.058771
##   15     4.072141  0.8299535  3.058664
##   16     4.074264  0.8297790  3.058976
##   17     4.074799  0.8297339  3.059962
##   18     4.074977  0.8297198  3.060346
##   19     4.074977  0.8297198  3.060346
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 11.
plot(back_tune)

Let´s see what variables have been selected and how the model performs:

coef(back_tune$finalModel, back_tune$bestTune$nvmax)
##                     (Intercept)                StatusDeveloping 
##                      68.6932102                      -0.5644930 
##                 Adult.Mortality                         Alcohol 
##                      -1.9452031                      -0.7045744 
##                             BMI                           Polio 
##                       0.9964461                       0.4831664 
##                      Diphtheria                        HIV.AIDS 
##                       0.6505967                      -2.7200491 
##                             GDP            thinness..1.19.years 
##                       0.6455364                      -0.3768232 
## Income.composition.of.resources                       Schooling 
##                       2.1330433                       2.8661297
test_results$bw <- predict(back_tune, testing)
postResample(pred = test_results$bw,  obs = test_results$Life.expectancy)
##     RMSE Rsquared      MAE 
## 3.924330 0.834699 2.982841

Here, we get pretty much the same result as with the forward regression. We can see it in the observed vs. predicted values plot (the similarities with the one before:

qplot(test_results$bw, test_results$Life.expectancy) + 
  labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed")  +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

Stepwise regression

Let´s now check stepwise regression:

step_tune <- train(Life.expectancy ~ ., data = training, 
                   method = "leapSeq", 
                   preProc=c('scale', 'center'),
                   tuneGrid = expand.grid(nvmax = 4:19),
                   trControl = control)
plot(step_tune)

Let´s see which variables are selected and how the model performs:

coef(step_tune$finalModel, step_tune$bestTune$nvmax)
##                     (Intercept)                StatusDeveloping 
##                     68.69321021                     -0.52859026 
##                 Adult.Mortality                   infant.deaths 
##                     -1.95757329                      0.03839523 
##                         Alcohol          percentage.expenditure 
##                     -0.68906856                      0.23459228 
##                     Hepatitis.B                         Measles 
##                     -0.19536435                     -0.17781085 
##                             BMI                           Polio 
##                      0.98664046                      0.49293815 
##               Total.expenditure                      Diphtheria 
##                      0.15088380                      0.72924319 
##                        HIV.AIDS                             GDP 
##                     -2.73748007                      0.36769269 
##                      Population            thinness..1.19.years 
##                      0.01786592                     -0.46828747 
##              thinness.5.9.years Income.composition.of.resources 
##                      0.13295805                      2.14321792 
##                       Schooling 
##                      2.85474312
test_results$seq <- predict(step_tune, testing)
postResample(pred = test_results$seq,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9043953 0.8364015 2.9682015

As it is seen here, we still get the same result (or almost the same one) in comparison with backward and forward regression. We can see it (again) in the plot of observed vs. predicted values in testing set:

qplot(test_results$seq, test_results$Life.expectancy) + 
  labs(title="Stepwise Regression Observed VS Predicted", x="Predicted", y="Observed") +
  geom_abline(intercept = 0, slope = 1, colour = "blue") +
  theme_bw()

Regularization Method

Ridge Regression

Let´s try now with our first regularizatino method: Ridge regression. Now, a squared magnitude of the coefficient is added as the penalty term to the loss function. This is good to prevent overfitting. This regression has a parameter that has to be tunned: lambda. If it is zero, then the equation will be the same as multiple linear regression. Instead, if the parameter is large, it will lead to under-fitting.

The bad thing about this regression is that is not useful for getting a parsimonious model, as it tends to shrink coefficient to near zero. So we can not do feature selection with it.

Let´s run this regression, and see how it performs in general.

ridge_grid <- expand.grid(lambda = seq(0, .1, length = 20))
ridge_tune <- train(Life.expectancy ~ ., data = training,
                    method='ridge',
                    preProc=c('scale','center'),
                    tuneGrid = ridge_grid,
                    trControl=control)
plot(ridge_tune)

ridge_tune$bestTune
##        lambda
## 2 0.005263158
test_results$ridge <- predict(ridge_tune, testing)
postResample(pred = test_results$ridge,  obs = test_results$Life.expectancy)
##     RMSE Rsquared      MAE 
## 3.901995 0.836567 2.964577

As we can see, our lambda parameter is pretty small and this may explain why we get similar performance in comparison to the multiple linear regression.

Lasso

Lasso (least absolute shrinkage and selection operator) performs also regularization. This time, the absolute value of the magnitude of the coefficient is added as the penalty term to the loss funcion. As some variable coefficient can be zero, we can perform feature selection with it.

Here, we also have a tunning parameter: lambda. Let´s see this model performance:

lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 20))
lasso_tune <- train(Life.expectancy ~ . , data = training,
                    method='lasso',
                    preProc=c('scale','center'),
                    tuneGrid = lasso_grid,
                    trControl=control)
plot(lasso_tune)

lasso_tune$bestTune
##     fraction
## 19 0.9478947
test_results$lasso <- predict(lasso_tune, testing)
postResample(pred = test_results$lasso,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9117949 0.8358292 2.9688481

Again, we get very similar results as with Rdige regression and multiple linear regression. Let´s check the variable importante to get some interpretation of what is happening.

importance <- varImp(lasso_tune, scale=FALSE)
plot(importance)

Elastic Net

Elastic net regression that combines, in a linear way, lasso regression and ridge regression (their penalties).

Let´s see if it performs better, although is highly probable it will provide us with a very similar result of the other two.

modelLookup('glmnet')
##    model parameter                    label forReg forClass probModel
## 1 glmnet     alpha        Mixing Percentage   TRUE     TRUE      TRUE
## 2 glmnet    lambda Regularization Parameter   TRUE     TRUE      TRUE
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))
glmnet_tune <- train(Life.expectancy ~. , data = training,
                     method='glmnet',
                     preProc=c('scale','center'),
                     tuneGrid = elastic_grid,
                     trControl=control)
plot(glmnet_tune)

glmnet_tune$bestTune
##     alpha lambda
## 154  0.13    0.1
test_results$glmnet <- predict(glmnet_tune, testing)
postResample(pred = test_results$glmnet,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9036419 0.8365401 2.9660770

We get a slightly lower MAE and the same R-squared value as we got in the previous ones.

We can check the value of each coefficient:

coef(glmnet_tune$finalModel, glmnet_tune$bestTune$lambda)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                                            1
## (Intercept)                     68.693210212
## StatusDeveloping                -0.507931034
## Adult.Mortality                 -1.960637780
## infant.deaths                    0.002951869
## Alcohol                         -0.613526866
## percentage.expenditure           0.230630014
## Hepatitis.B                     -0.167257030
## Measles                         -0.153379641
## BMI                              0.983151867
## Polio                            0.488482747
## Total.expenditure                0.138442019
## Diphtheria                       0.713065025
## HIV.AIDS                        -2.713100223
## GDP                              0.373502218
## Population                       0.015669246
## thinness..1.19.years            -0.329271778
## thinness.5.9.years               .          
## Income.composition.of.resources  2.152015642
## Schooling                        2.788738105

Let´s see what variables are the most important according to this method:

importance <- varImp(glmnet_tune, scale=FALSE)
plot(importance)

coef(glmnet_tune$finalModel, glmnet_tune$bestTune$lambda)[,1]
##                     (Intercept)                StatusDeveloping 
##                    68.693210212                    -0.507931034 
##                 Adult.Mortality                   infant.deaths 
##                    -1.960637780                     0.002951869 
##                         Alcohol          percentage.expenditure 
##                    -0.613526866                     0.230630014 
##                     Hepatitis.B                         Measles 
##                    -0.167257030                    -0.153379641 
##                             BMI                           Polio 
##                     0.983151867                     0.488482747 
##               Total.expenditure                      Diphtheria 
##                     0.138442019                     0.713065025 
##                        HIV.AIDS                             GDP 
##                    -2.713100223                     0.373502218 
##                      Population            thinness..1.19.years 
##                     0.015669246                    -0.329271778 
##              thinness.5.9.years Income.composition.of.resources 
##                     0.000000000                     2.152015642 
##                       Schooling 
##                     2.788738105

Principal Component Regression

Let´s do another kind of regression now: principal component regression. Let´s see how it performs:

pcr_tune <- train(Life.expectancy ~ . , data = training,
                  method='pcr',
                  preProc=c('scale','center'),
                  tuneGrid = expand.grid(ncomp = 2:8),
                  trControl=control)
plot(pcr_tune)

test_results$pcr <- predict(pcr_tune, testing)
postResample(pred = test_results$pcr,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9978858 0.8284269 3.0785942

Here we get a worse result even than before. Our R-squared has gone down to 0.8290 and our MAE is up to 3.06, so this model is worse both for accounting variability of the dataset and also for predicting.

Partial Least Square Regression (PLS)

Now we will do something similar as PCR, but we will find a linear regression by projecting the predicted values and the observable variables to a new space. Let´s see what happens when we run it:

pls_tune <- train(Life.expectancy ~ . , data = training,
                  method='pls',
                  preProc=c('scale','center'),
                  tuneGrid = expand.grid(ncomp = 2:8),
                  trControl=control)
plot(pls_tune)

test_results$pls <- predict(pls_tune, testing)
postResample(pred = test_results$pls,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.9059808 0.8362743 2.9680634

As we can see, our results are not that bad now. We get a R-square measure 0.8386 and a MAE of 2.927 (and a RMSE of 3.86).

Some different approaches

Now, we can try and see with different approaches the most important variables in our dataset. First let´s do Recursive Feature Elimination.

Recursive Feature Elimination

Here, we can use a different function in order to select a different algorithm. I will use: random forest and bagging of trees.

First with Random Forest:

modelRFE <- rfe(Life.expectancy ~ ., data=training, sizes=c(1:19), rfeControl=rfeControl(functions=rfFuncs, method="cv", number=5))
print(modelRFE)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared   MAE RMSESD RsquaredSD   MAESD Selected
##          1 5.801   0.6525 4.492 0.3881   0.046921 0.31979         
##          2 3.308   0.8884 2.486 0.1542   0.012040 0.12929         
##          3 2.351   0.9447 1.666 0.1306   0.006062 0.07039         
##          4 2.335   0.9459 1.627 0.1588   0.007789 0.09882         
##          5 2.286   0.9485 1.547 0.1719   0.008049 0.10709         
##          6 1.988   0.9596 1.277 0.1169   0.005045 0.05513         
##          7 1.990   0.9597 1.268 0.1085   0.004792 0.05535         
##          8 1.982   0.9602 1.257 0.1647   0.006735 0.06802         
##          9 1.922   0.9623 1.206 0.1564   0.006264 0.07624        *
##         10 1.927   0.9622 1.210 0.1669   0.006691 0.06407         
##         11 1.941   0.9617 1.225 0.1747   0.007079 0.07020         
##         12 1.937   0.9617 1.223 0.1654   0.006555 0.06697         
##         13 1.949   0.9614 1.242 0.1662   0.006682 0.06817         
##         14 1.973   0.9604 1.262 0.1709   0.007002 0.07279         
##         15 1.975   0.9603 1.267 0.1567   0.006374 0.06268         
##         16 1.991   0.9597 1.279 0.1634   0.006742 0.06718         
##         17 2.004   0.9593 1.295 0.1499   0.006322 0.05533         
##         18 1.981   0.9600 1.280 0.1549   0.006369 0.06539         
## 
## The top 5 variables (out of 9):
##    HIV.AIDS, Income.composition.of.resources, Adult.Mortality, Total.expenditure, Alcohol
predictors(modelRFE)
## [1] "HIV.AIDS"                        "Income.composition.of.resources"
## [3] "Adult.Mortality"                 "Total.expenditure"              
## [5] "Alcohol"                         "thinness.5.9.years"             
## [7] "infant.deaths"                   "Schooling"                      
## [9] "BMI"
ggplot(data = modelRFE, metric = "MAE") + theme_bw()

Here the result is interesting. According to this approach, the top 5 variables are “HIV”, “Income Composition of resources”, “Alcohol” and “BMI”. Let´s see what happens when we do the same but with the function calling for bagging.

Second, with treebagFuncs:

modelRFE <- rfe(Life.expectancy ~ ., data=training, sizes=c(1:19), rfeControl=rfeControl(functions=treebagFuncs, method="cv", number=5))
print(modelRFE)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared   MAE RMSESD RsquaredSD  MAESD Selected
##          1 4.693   0.7725 3.052 0.5602   0.050012 0.2077         
##          2 3.394   0.8827 2.494 0.2930   0.016414 0.1912         
##          3 3.355   0.8851 2.458 0.2323   0.011891 0.1578         
##          4 3.346   0.8859 2.441 0.2340   0.012106 0.1975         
##          5 3.360   0.8850 2.444 0.2861   0.016235 0.2053         
##          6 3.321   0.8881 2.421 0.2034   0.010617 0.1694         
##          7 3.337   0.8866 2.431 0.2584   0.014468 0.1936         
##          8 3.324   0.8876 2.419 0.2482   0.013797 0.1935         
##          9 3.331   0.8870 2.427 0.2497   0.013510 0.1921         
##         10 3.337   0.8870 2.423 0.2064   0.009882 0.1671         
##         11 3.343   0.8860 2.437 0.2385   0.012892 0.1552         
##         12 3.299   0.8893 2.404 0.2475   0.013014 0.1731         
##         13 3.311   0.8884 2.408 0.2083   0.010718 0.1600         
##         14 3.254   0.8925 2.368 0.2561   0.013633 0.1671        *
##         15 3.320   0.8879 2.405 0.2379   0.013334 0.1844         
##         16 3.307   0.8887 2.408 0.2926   0.017124 0.2173         
##         17 3.317   0.8881 2.419 0.2422   0.013225 0.1919         
##         18 3.310   0.8885 2.411 0.1929   0.009299 0.1398         
## 
## The top 5 variables (out of 14):
##    Adult.Mortality, Income.composition.of.resources, HIV.AIDS, Schooling, BMI
predictors(modelRFE)
##  [1] "Adult.Mortality"                 "Income.composition.of.resources"
##  [3] "HIV.AIDS"                        "Schooling"                      
##  [5] "BMI"                             "thinness.5.9.years"             
##  [7] "infant.deaths"                   "thinness..1.19.years"           
##  [9] "Alcohol"                         "GDP"                            
## [11] "Total.expenditure"               "Measles"                        
## [13] "StatusDeveloping"                "Diphtheria"
ggplot(data = modelRFE, metric = "MAE") + theme_bw()

Now the result is different. According to this, the most important variable is “Adult Mortality”, while “HIV” and “BMI” are also included. It is a similar result, although now the top variable is different.

Random Forest for variable selection

We can use random forest algorithm to meassure the importance of each variable in predicting the output. Let´s see:

fitRF <- train(Life.expectancy ~ .,data=training, method='rf', preProcess="scale", trControl=control)
importance <- varImp(fitRF, scale=FALSE)
plot(importance)

Now the important variables are the same, but in a different order. First we have “income composition of resources” (or HDI), followed by “HIV”, “Adult Mortality” and “Schooling”. In particular, is interesting how the variable “Schooling” is so important to predict the life expectation in a country, as its relationship with the amount of years that someone lives is not straightforward. It is probable that the more educated someone is, the more opportunities of having a better life and thus, live more years.

FOCI

Let´s see what FOCI has to say about the variable importance.

y <- training$Life.expectancy
x <- training[,3:19]
foci(y,x)$selectedVar
##    index                           names
## 1:     1                 Adult.Mortality
## 2:    16 Income.composition.of.resources
## 3:    14            thinness..1.19.years
## 4:     2                   infant.deaths
## 5:    11                        HIV.AIDS
## 6:    15              thinness.5.9.years

Now, through FUCI, we get the same important variables, although now Thinness from 1-9 years seems to be important, as well as infant deaths and thinness from 5-9 years old. Now HIV is not the most important one, nor the second one.

Boruta

Boruta is another algorithm to help us check what are the most important variables:

boruta <- Boruta(Life.expectancy ~ ., data=training, doTrace=0)  
plot(boruta, cex.axis=.7, las=2, xlab="", main="Variable Importance")

According to Boruta, the most important variable is HIV, followed by Adult Mortality, HDI and Alcohol.

Comparing Models

mods <- resamples(list(lm =lm1, glm = lm2,glm2 = modboth2, AICstep =modboth, Elasticnet = glmnet_tune,Lasso = lasso_tune, Ridge=ridge_tune,PCR=pcr_tune,PLS=pls_tune))
summary(mods)
## 
## Call:
## summary.resamples(object = mods)
## 
## Models: lm, glm, glm2, AICstep, Elasticnet, Lasso, Ridge, PCR, PLS 
## Number of resamples: 10 
## 
## MAE 
##                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm         2.904209 2.977280 3.059957 3.052689 3.130283 3.186402    0
## glm        2.541723 2.973178 3.042020 3.010130 3.120245 3.266869    0
## glm2       2.104592 2.308031 2.435948 2.418056 2.554022 2.623127    0
## AICstep    2.057897 2.244927 2.359047 2.403024 2.552085 2.867144    0
## Elasticnet 2.719857 2.872102 3.083121 3.046356 3.144336 3.516041    0
## Lasso      2.682250 2.972440 3.056770 3.043569 3.183607 3.365308    0
## Ridge      2.733071 2.981166 3.007773 3.056833 3.145932 3.570312    0
## PCR        2.903939 3.038067 3.182672 3.198087 3.343363 3.576914    0
## PLS        2.840729 2.880723 3.082646 3.055828 3.201801 3.311333    0
## 
## RMSE 
##                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm         3.700169 3.966548 4.101958 4.070353 4.172824 4.333241    0
## glm        3.197790 3.887228 4.096885 3.985362 4.188647 4.387456    0
## glm2       2.703738 3.149077 3.281939 3.308094 3.582901 3.662857    0
## AICstep    2.740023 3.020834 3.194452 3.278964 3.518878 3.910933    0
## Elasticnet 3.680510 3.852835 4.077948 4.062508 4.167902 4.633414    0
## Lasso      3.401217 3.903813 4.093806 4.070539 4.356432 4.472962    0
## Ridge      3.583234 3.951096 4.073430 4.069262 4.165969 4.752926    0
## PCR        3.730052 4.109398 4.201814 4.250417 4.420046 4.808915    0
## PLS        3.544669 3.830500 4.132513 4.063984 4.269903 4.545798    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lm         0.7862637 0.8241307 0.8376867 0.8302307 0.8412938 0.8527473    0
## glm        0.8108844 0.8246107 0.8311044 0.8378114 0.8451480 0.8828211    0
## glm2       0.8614454 0.8734216 0.8815586 0.8868794 0.9024361 0.9229028    0
## AICstep    0.8598332 0.8810161 0.8922125 0.8897229 0.9002519 0.9171064    0
## Elasticnet 0.7868369 0.8063373 0.8293767 0.8298865 0.8525653 0.8700561    0
## Lasso      0.8055561 0.8155135 0.8282103 0.8312472 0.8474812 0.8637735    0
## Ridge      0.7929892 0.8140624 0.8359233 0.8322962 0.8448253 0.8686858    0
## PCR        0.7651896 0.8019927 0.8200900 0.8147053 0.8299610 0.8454111    0
## PLS        0.7855012 0.8144389 0.8329310 0.8311162 0.8486403 0.8640585    0

When compring the different achieved models, we can say that:

  1. Regarding R-squared: in average, the best model is glm2: the general linear model that takes a lot of interactions and variables, and uses a gamma family for the target variable. This model can explain 88% of the total variability in the dataset. When chasing a simpler model, the general linear model with gamma family is pretty good too.

  2. Regarding MAE: in average the best model is also glm2: as it gets in average a mean absolute error in prediction of 3.292351. When checking the minimums, the model that gets the lowest MAE is the multiple regression with a lot of interactions and variables. When looking for a more parsimonious model, we can find that general linear model with all the variables but no interaction gets to (minimum) 2.71.

Combining Models

A good way of finding a better model from what has been done until now, is to combine some of them and check if by calculating the average of their predictions we get to a better result. Let´s try!

test_results$comb = (test_results$modboth + test_results$modboth2   +  test_results$lm2 +test_results$forwint)/4
postResample(pred = test_results$comb,  obs = test_results$Life.expectancy)
##      RMSE  Rsquared       MAE 
## 3.2752265 0.8852543 2.4989780

This is a fairly good model. The mean absolute error is one of the smallest, and is accompanied

Prediction Intervals

Now it´s time to calculate some intervals for the predictions (different than the regular confidence interval).

I will calculate these prediction intervals with my best linear model first, the one with many variables and interactions.

modelo <- lm(Model1,training)
predictions <- predict.lm(modelo, newdata=testing, interval="prediction", level=0.90)
head(predictions)
##         fit      lwr      upr
## 3  62.92031 57.60022 68.24040
## 8  59.78423 54.44552 65.12293
## 12 55.82064 50.17591 61.46537
## 17 76.08611 70.84194 81.33028
## 21 75.02596 69.78311 80.26882
## 23 73.91584 68.66886 79.16281
predictions=as.data.frame(predictions)
predictions$real = (test_results$Life.expectancy)
predictions = predictions %>% mutate(out=factor(if_else(real<lwr | real>upr,1,0)))
mean(predictions$out==1)
## [1] 0.0940919

We can see that somewhere around 10% of the real observations lay outside of the prediction interval. Let´s check it in the plot:

ggplot(predictions, aes(x=fit, y=real))+
  geom_point(aes(color=out)) + theme(legend.position="none") +
  xlim(10, 100) + ylim(10, 100)+
  geom_ribbon(data=predictions,aes(ymin=lwr,ymax=upr),alpha=0.3) +
  labs(title = "Prediction intervals", x = "prediction",y="real life expectancy")

As we can see, our prediction intervals are pretty good, as less than 10% (slightly) of the rel values are outside of the interval, and most of them are inside.

Let´s see what happens when we want to build prediction intervals for the model we just created by combining different ones.

yhat = (test_results$comb)
head(yhat)
## [1] 62.62510 59.90074 56.44643 76.29174 75.08380 73.92900
hist(yhat, col="lightblue")

As we can see, the distribution of the predictions looks kind of like a normal distribution, although it is a bit skewed. Let´s check the error in prediction:

y = (test_results$Life.expectancy)
error = y-yhat
hist(error, col="lightblue")

The error distribution is a bit more asymetric than the prediction distribution, but it also looks more centered (the median is close to zero). Now, I am going to split the testing set in two different parts. This is due to the fact that I can not use the same testing set to get the confidence interval. This way, I will compute with one part the noise, and with the other I will obtain the intervals. As I have 459 observations in my testing set, I will use the first 180 for the noise, and the rest to build the interval.

noise = error[1:180]
sd.noise = sd(noise)
hist(noise)

Now, I can use the rest of the observation in the testing set to build the interval. As we can see in the histogram above, the noise behaves very similar to a normal distribution. We can check with a Shapiro-Wilk test (normality test).

x.test <- shapiro.test(noise)
x.test
## 
##  Shapiro-Wilk normality test
## 
## data:  noise
## W = 0.98868, p-value = 0.1613

As we can see, the p-value of the test is higher than 0.05, meaning that we do not reject the null hypotesis of normality at a 95% of confidence level. I can proceed to build the intervals now. As we know, when a distribution behaves like a normal, 90% of the data is supossed to be inside 1.65 standard deviations. Like follows:

lwr = yhat[181:length(yhat)] - 1.65*sd.noise
upr = yhat[181:length(yhat)] + 1.65*sd.noise
pred <- yhat[181:length(yhat)]
head(cbind(lwr,upr,pred))
##           lwr      upr     pred
## [1,] 65.11959 75.53269 70.32614
## [2,] 66.95929 77.37238 72.16583
## [3,] 62.51743 72.93053 67.72398
## [4,] 60.57222 70.98532 65.77877
## [5,] 61.26118 71.67427 66.46772
## [6,] 74.00968 84.42278 79.21623

Here we see the first intervals of each value of prediction of life expectancy of a country.

Let´s do the same, now with a non-parametric way (we will get the same result):

lwr = yhat[181:length(yhat)] + quantile(noise,0.05, na.rm=T)
upr = yhat[181:length(yhat)] + quantile(noise,0.95, na.rm=T)
head(cbind(lwr,upr,pred))
##           lwr      upr     pred
## [1,] 65.01702 75.71331 70.32614
## [2,] 66.85672 77.55300 72.16583
## [3,] 62.41486 73.11115 67.72398
## [4,] 60.46965 71.16594 65.77877
## [5,] 61.15861 71.85489 66.46772
## [6,] 73.90711 84.60340 79.21623

Now we can check if there are any kind of anomalies (outliers), by checking first in a more conservative way, and then in a more extreme way. If there are outliers in the second plot, it means they are extreme outliers.

lwr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=1.5)$stats[1]
upr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=1.5)$stats[5]

y2 <- y[181:length(y)]
yhat2 <- yhat[181:length(yhat)]
df <- data.frame(y2,yhat2,lwr,upr)
head(df)
##     y2    yhat2      lwr      upr
## 1 73.9 70.32614 62.85375 77.50882
## 2 73.0 72.16583 64.69345 79.34852
## 3 71.9 67.72398 60.25159 74.90666
## 4 71.6 65.77877 58.30638 72.96145
## 5 71.0 66.46772 58.99534 73.65041
## 6 75.6 79.21623 71.74384 86.39891
df = df %>% mutate(out=factor(if_else(y2<lwr | y2>upr,1,0)))
mean(df$out==1)
## [1] 0.05415162

As we can see, only 4.6% of the predictions are outside the interval.

Now, this is even more robust:

lwr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=3)$stats[1]
upr = yhat[181:length(yhat)] + boxplot.stats(noise, coef=3)$stats[5]

y2 <- y[181:length(y)]
yhat2 <- yhat[181:length(yhat)]
df <- data.frame(y2,yhat2,lwr,upr)

df = df %>% mutate(out=factor(if_else(y2<lwr | y2>upr,1,0)))
mean(df$out==1)
## [1] 0.01083032

Now, only 2% of the predictions are outside the interval. This is a very good interval then.